System for evaluating large language models using multi-dataset benchmarking and performance analysis modules
The system addresses the limitations of existing evaluation methods by managing multiple datasets in parallel, synchronizing execution, adapting resource allocation, and ensuring tamper-proof storage, resulting in a more reliable and efficient evaluation of large language models.
Patent Information
- Authority / Receiving Office
- DE · DE
- Patent Type
- Utility models
- Current Assignee / Owner
- MOTANI
- Filing Date
- 2026-05-07
- Publication Date
- 2026-07-02
AI Technical Summary
Existing evaluation methods for large language models lack the capability to execute multiple heterogeneous datasets in parallel, synchronize execution times, adaptively allocate resources, correlate quality and resource metrics with input data characteristics, and ensure tamper-proof storage, leading to inconsistent and untraceable results.
A system that manages multiple heterogeneous datasets in parallel, synchronizes execution via a central clock, adaptively allocates benchmarking processors across heterogeneous computing units, captures quality and resource metrics, and stores results with cryptographic checksums to ensure reproducibility and integrity.
Enables technically reproducible, hardware-aware, and dataset-spanning analysis of large language models, providing accurate, balanced resource utilization, and tamper-proof storage of evaluation results.
Abstract
Description
Technical field The present invention relates to the technical field of computer-aided evaluation, analysis, and comparability of large language models. In particular, the invention relates to a system for the hardware-based, time-synchronized, and resource-conscious evaluation of large language models using multiple heterogeneous evaluation datasets. The invention further lies in the fields of artificial intelligence, machine language processing, model validation, performance benchmarking, hardware telemetry, reproducible evaluation of computer-aided models, and tamper-proof storage of evaluation results.In particular, the invention relates to a technical system architecture in which several evaluation data sets are processed in parallel, benchmarking processors are adaptively assigned to heterogeneous computing units, quality metrics and resource metrics are recorded synchronously and correlated with each other, and the resulting evaluation results are stored in a traceable manner with cryptographic checksums and timestamps. State of the art Large language models are increasingly used in technical, scientific, economic, and industrial applications, such as automated text generation, information processing, decision support, software development, document analysis, and human-machine interaction. With the increasing prevalence of such models, the need for objective, reproducible, and technically robust evaluation methods is also growing. These methods can be used to verify the performance, reliability, response quality, and operational efficiency of large language models under varying operating conditions. Common evaluation approaches often rely on single benchmark datasets or a limited number of standardized test tasks. Typically, inputs are fed into a language model under test, the resulting outputs are compared with reference responses or rating scales, and then quality metrics such as accuracy, agreement, response consistency, or task fulfillment are determined. While such approaches allow for a basic assessment of model performance, they often fail to adequately consider that large language models can exhibit highly variable performance profiles depending on the dataset type, input length, input modality, difficulty level, and task structure. Benchmarking environments are also known in which multiple evaluation datasets are executed sequentially. However, such sequential execution has the disadvantage that the evaluation conditions can diverge over time. Changes in hardware utilization, memory usage, temperature, background processes, or power consumption can influence the measured performance values. This leads to discrepancies that make a direct comparison of the results of different datasets difficult. Furthermore, systems for measuring the performance of computational models are known that record individual hardware parameters such as processor utilization, memory usage, or response latency. However, these systems are often not designed to precisely link the recorded resource metrics with the model-related quality metrics and the specific input data characteristics in a temporally accurate manner. As a result, it often remains unclear whether reduced model performance was caused by a specific data set structure, a particular input characteristic, or a hardware-related resource limitation. Another problem with established evaluation approaches is that they typically rely on a fixed mapping between evaluation tasks and computing resources. However, large language models can be executed or supported on various heterogeneous computing units, with individual units differing significantly in terms of processing power, memory architecture, energy consumption, latency, and parallelization capabilities. A rigid mapping of evaluation processes to computing units can therefore lead to unbalanced load distributions, distorted measurement results, and inefficient resource utilization. Furthermore, well-known benchmarking systems often lack sufficient technical reproducibility. With large language models, even minor changes to selection parameters, random number generators, processing core selection, model weights, hardware configurations, or rounding modes can lead to different outputs. If such parameters are not controlled and documented, evaluation results can only be partially understood or replicated later. Furthermore, a problem with known evaluation environments is that the stored results are not sufficiently protected against manipulation. While results can be stored in databases or log files, a technical safeguard that clearly demonstrates which model weights, datasets, hardware configurations, and timestamps underlie a specific evaluation is often lacking. Such traceability is of considerable importance, particularly in scientific validation, regulatory review, model certification, or commercial comparative evaluation. There is therefore a need for an improved technical system for evaluating large language models that can execute multiple heterogeneous evaluation datasets in parallel and in a time-synchronized manner, adaptively account for heterogeneous computing units, capture quality and resource metrics in a time-synchronized manner, correlate these metrics with input data characteristics, and store the evaluation results in a tamper-proof manner. The present invention addresses these technical problems. Object of the invention The object of the present invention is therefore to provide a system for evaluating large language models that overcomes the aforementioned disadvantages and in particular enables simultaneous, time-synchronized evaluation across a large number of heterogeneous datasets, time-synchronous acquisition of quality, performance and resource metrics, adaptive, dataset- and hardware-specific load distribution on heterogeneous computing units, automated correlation analysis between input data characteristics and model behavior, and integrity-assured, reproducible persistence of the evaluation results. Summary of the invention The present invention provides a system for evaluating large language models that enables a technically reproducible, hardware-aware, and dataset-spanning comparable analysis of the performance of such models. The system is specifically designed to manage and execute multiple heterogeneous evaluation datasets in parallel, to synchronize execution over time, to adaptively incorporate heterogeneous computing units, and to capture both model-related quality metrics and hardware-related resource metrics within a common evaluation context. The system includes a data set management unit that manages at least two different assessment data sets and generates a data set profile for each. This data set profile can contain, for example, data set-specific parameters such as input length, task type, input modality, difficulty index, expected response structure, or other technical characteristics of the respective assessment data set. This profiling enables assessment runs to be analyzed not merely as isolated tests, but in conjunction with the technical characteristics of the input data. Furthermore, the system includes a multi-dataset benchmarking module with several benchmarking processors that can be executed in parallel. Each benchmarking processor is assigned to an evaluation dataset and its timing is synchronized via a central synchronization unit based on a common hardware clock reference signal. This allows evaluation tasks to be performed across different datasets under comparable timing conditions. This improves the technical comparability of response times, processing latencies, resource consumption values, and quality metrics. The system also features a hardware abstraction layer that detects at least two heterogeneous computing units and adaptively distributes the benchmarking processors across these units. This allocation is based on the data set profile, a model profile, and current hardware telemetry. This prevents evaluation processes from being rigidly or randomly assigned computing resources. Instead, the system can utilize available computing resources dynamically and in a technically transparent manner, enabling a more balanced load distribution and a more accurate assessment of model performance. A performance analysis module captures model-related quality metrics as well as hardware-related resource metrics. Quality metrics can include, for example, response accuracy, consistency, error rate, task resolution, or semantic agreement. Resource metrics can encompass hardware load, memory usage, processing latency, energy consumption, or other values captured by hardware-related meters. A correlation unit links these quality and resource metrics with the input data characteristics, enabling the technical determination of dataset- and input-dependent performance deviations of the large language model. In one embodiment, the system includes a deterministic control mechanism configured to define pseudo-random number generators, selection parameters, arithmetic kernel selection, and floating-point rounding modes. This improves the repeatability of evaluation runs and reduces the probability of uncontrolled deviations in results. This is particularly important for large language models, as even minor changes to the execution conditions can lead to different model outputs. The invention further comprises a persistence layer that stores evaluation results together with cryptographic checksums of the model weights used, data set checksums, hardware configurations, and timestamps. In a preferred embodiment, the persistence layer can include a cryptographic checksum chain, thereby making subsequent changes to stored evaluation results detectable. In this way, increased traceability, integrity, and reliability of the evaluation results are achieved. By combining parallel multi-dataset evaluation, hardware-based time synchronization, adaptive resource allocation, time-synchronized quality and resource analysis, and tamper-proof storage, the invention provides an improved technical system that supports the objective evaluation of large language models. The invention is particularly suitable for research institutions, testing laboratories, model providers, companies, certification bodies, and technical evaluation platforms that need to analyze large language models under reproducible and hardware-aware conditions. Detailed description of the invention The present invention is described in more detail below with reference to preferred embodiments. The described embodiments serve to illustrate the technical teaching and are not to be understood as limiting the scope of protection. Individual features of the described embodiments can be combined, exchanged, or replaced with technically equivalent features, provided that the basic function of the system for evaluating large language models is thereby maintained. The system according to the invention is designed for evaluating large language models and enables the parallel, time-synchronized, and hardware-aware execution of multiple evaluation runs. In contrast to conventional evaluation environments, where data sets are often processed sequentially and results are stored without precise technical mapping to hardware states, the present system provides a structured technical architecture in which data set profiles, model profiles, hardware telemetry, quality metrics, resource metrics, and integrity information are processed together. The system includes a data set management unit that manages at least two heterogeneous assessment data sets simultaneously. Heterogeneous assessment data sets are defined as those that differ in task type, input structure, input length, input modality, language domain, difficulty level, answer format, or assessment logic. For example, a first assessment data set might contain short question-and-answer tasks, while a second assessment data set might contain long legal, technical, or scientific text inputs. A further assessment data set might contain multi-stage reasoning tasks, programming tasks, dialogue situations, or domain-specific subject-matter questions. For each assessment record, the record management unit generates a record profile. This profile contains record-specific parameters used for the technical planning and evaluation of the assessment run. These parameters can include, in particular, an average input length, a maximum input length, an average output length, an input modality, a task class, an expected response structure, a difficulty index, a domain identifier, a token distribution, a record size, memory requirements, and an expected processing load. The record profile can be automatically calculated from the respective assessment record or supplemented by user input. The system also includes a multi-dataset benchmarking module, which provides multiple benchmarking processors that can be executed in parallel. Each benchmarking processor is assigned to a specific evaluation dataset or a section of an evaluation dataset. The benchmarking processors handle the technical control of the model queries, input processing, output recording, temporary storage of response data, and the transfer of evaluation information to the performance analysis module. Parallel execution allows the system to evaluate multiple evaluation datasets under comparable operating conditions. The benchmarking processors are synchronized by a central synchronization unit. This unit uses a common hardware clock reference signal to coordinate the start times, measurement windows, response time measurements, and acquisition intervals of the benchmarking processors. The hardware clock reference signal can be provided by an internal time source, a clock reference generated close to the processor, a high-resolution hardware clock, or another suitable time synchronization device. In a preferred embodiment, the clock reference signal has a temporal resolution of at most one millisecond. Hardware-based synchronization allows latency, energy consumption, memory access, and utilization values from different evaluation runs to be correlated. This is particularly advantageous when multiple datasets are processed in parallel and hardware conditions change during execution. The synchronization unit can assign a time index or timestamp to each measurement, enabling subsequent correlation between model responses, input data, and hardware states. The system also includes a hardware abstraction layer. This hardware abstraction layer is designed to detect at least two heterogeneous computing units and capture their technical characteristics. Examples of such units include general-purpose computing units, graphics-based computing units, tensor-based computing units, neural processing units, field-programmable accelerators, or other hardware-accelerated computing units. The hardware abstraction layer can capture technical characteristics such as available processing cores, memory size, memory bandwidth, power state, current utilization, temperature, supported computational formats, parallelization capability, and latency behavior. Based on the collected hardware information, the hardware abstraction layer adaptively assigns the benchmarking processors to the available computing units. This assignment is not static, but rather takes into account the data set profile, a model profile, and current hardware telemetry. The model profile can contain technical information about the large language model being evaluated, such as model size, number of parameters, context window size, memory requirements, preferred computation format, inference mode, expected computational load, response length parameters, and supported execution environments. In one embodiment, adaptive allocation is achieved via an allocation matrix. This matrix contains weighted values that combine dataset-specific characteristics, model-specific profile parameters, and current hardware telemetry values. For example, an evaluation dataset with long inputs and a high expected memory load can be allocated to a processing unit with larger available memory, while a dataset with many short inputs is allocated to a processing unit with particularly low latency. The allocation matrix can be updated during an ongoing evaluation run if the hardware telemetry changes significantly. The system includes a performance analysis module designed for the combined collection and evaluation of quality and resource metrics. This module features a quality metric unit that captures model-related evaluation metrics. Depending on the evaluation task, these quality metrics can encompass various values, such as response accuracy, semantic agreement, completeness, consistency, error rate, task resolution rate, hallucination rate, response stability, relevance, language quality, or agreement with predefined reference responses. In addition to the quality metrics unit, the performance analysis module includes a resource metrics unit. This unit records hardware-related resource metrics synchronously with the evaluation runs. These resource metrics include, in particular, hardware load, memory usage, memory bandwidth, processing latency, response time, energy consumption, temperature, compute unit utilization, waiting time, data transfer time, and the number of hardware-related operations. Data can be collected via generic hardware monitoring interfaces, power meters, energy consumption meters, memory access counters, or processor utilization counters. A key feature of the system is that resource metrics are not recorded in isolation, but rather linked to quality metrics and input data characteristics. For this purpose, the performance analysis module includes a correlation unit. This unit connects the quality scores generated during an evaluation run with the corresponding resource scores and the characteristics of the respective input. In this way, it can be determined whether specific input types, input lengths, difficulty levels, or task classes lead to increased resource consumption, longer response times, or reduced response quality. In another embodiment, the correlation unit includes a classification module. This classification module can categorize erroneous model outputs into error classes. Such error classes can include, for example, factual errors, incomplete responses, irrelevant responses, contradictory outputs, format violations, calculation errors, source errors, or response terminations. The error classes can then be correlated with input data characteristics and resource metrics. This reveals whether certain error types occur more frequently under specific hardware conditions or with particular dataset types. The system can also include a deterministic control mechanism. This mechanism is designed to define technical parameters to ensure that evaluation runs are executed as reproducibly as possible. To this end, the deterministic control can fix pseudo-random number generators, selection parameters, temperature parameters, probability limits, processor selection, processing order, floating-point rounding modes, and other output-relevant parameters. This technical control reduces uncontrolled deviations between multiple evaluation runs. Deterministic control can generate an execution set before the start of an evaluation run, containing all deterministic parameters relevant to that run. This execution set can be saved along with the evaluation results. This makes it possible to trace the technical conditions under which a particular result was generated. This is especially important for scientific comparative studies, certification procedures, model releases, and internal quality assurance. To store the results, the system includes a persistence layer. This layer stores the evaluation results along with accompanying technical information. This information includes, in particular, cryptographic checksums of the model weights used, data set checksums, hardware configurations, timestamps, data set profiles, model profiles, allocation information, and selected telemetry parameters. The cryptographic checksums enable integrity verification of the stored results and the underlying technical data. In a preferred embodiment, the persistence layer incorporates a cryptographic checksum chain. Each stored result record is cryptographically linked to a previous result record. Any subsequent modification of a single result record would thus alter the checksum chain and become detectable. This tamper detection is particularly advantageous when evaluation results are used for external audits, comparative certificates, public rankings, industrial approvals, or regulatory documentation. The system can also include a graphical display unit. This unit is designed for the real-time presentation of evaluation results, quality metrics, resource metrics, and correlation information. For example, the display unit can show how response quality, latency, memory usage, and energy consumption behave across different datasets. Similarly, the display unit can show error classes, hardware utilization, dataset profiles, and changes in the allocation matrix. This enables immediate technical monitoring of the ongoing evaluation process. In a typical operation, the data set management unit loads multiple evaluation data sets and generates a data set profile for each one. The hardware abstraction layer then identifies the available processing units and their current telemetry data. Based on the data set profiles, the model profile, and the hardware telemetry, the system creates an allocation matrix. Finally, the benchmarking processors are assigned to the respective processing units and started in a coordinated manner via the central synchronization unit. During execution, the benchmarking processors send inputs to the large language model and capture the generated outputs. Simultaneously, the resource metrics unit captures hardware-level measurements within synchronized measurement windows. The quality metrics unit evaluates the model outputs based on the evaluation logic defined for the respective dataset. The correlation unit then links the quality values with the resource values and the input data characteristics. The results are subsequently stored in the persistence layer along with cryptographic checksums, timestamps, and technical configuration data. The invention thus enables an evaluation of large language models that considers not only the quality of the generated responses but also the technical costs of these responses. For example, a model may exhibit high response quality on a first dataset but cause disproportionately high energy consumption or high latency. Another model may operate faster and more efficiently with certain input types but have a higher error rate. The correlation of these values according to the invention enables a significantly more differentiated technical evaluation. The architecture according to the invention is suitable for various operating environments. It can be used in local data centers, testing laboratories, enterprise environments, cloud infrastructures, university research systems, or certification platforms. The system can be implemented as a standalone device, as a server system, as a distributed computing platform, or as part of a model evaluation platform. The individual units can be implemented in hardware, software, or a combination of hardware and software, provided that the claimed technical functions are provided. The advantages of the invention lie particularly in the improved comparability of evaluation runs, the adaptive use of heterogeneous computing resources, the time-synchronous acquisition of quality and resource metrics, the technical correlation of model performance and input properties, the increased reproducibility through deterministic control, and the tamper-proof storage of results. This provides a technical system that significantly improves the objective, traceable, and resource-conscious evaluation of large language models.
Claims
System for evaluating large language models, comprising: a dataset management unit configured to simultaneously manage at least two heterogeneous evaluation datasets and generating a dataset profile with dataset-specific parameters for each evaluation dataset; a multi-dataset benchmarking module comprising a plurality of parallel-executable benchmarking processors, each benchmarking processor being assigned to an evaluation dataset and time-synchronized via a central synchronization unit based on a common hardware clock reference signal; a hardware abstraction layer configured to detect at least two heterogeneous computing units and to adaptively assign the benchmarking processors to the heterogeneous computing units, the assignment being based on the dataset profile, a model profile, and current hardware telemetry;a performance analysis module with a quality metrics unit for capturing model-related evaluation metrics, a resource metrics unit for time-synchronous acquisition of hardware load, memory usage, latency, and energy consumption via hardware-related meters, and a correlation unit configured to link the quality metrics and resource metrics with input data characteristics; and a persistence layer for storing the evaluation results along with a cryptographic checksum of used model weights, dataset checksums, hardware configurations, and a timestamp. System according to claim 1, characterized in that the central synchronization unit uses a hardware-based clock reference signal whose temporal resolution is at most one millisecond, wherein the temporal coordination of the parallel executable benchmarking processors enables a comparable acquisition of model response times, processing latencies and resource consumption values across the at least two heterogeneous evaluation data sets. System according to claim 1 or 2, characterized in that the hardware abstraction layer supports at least two different types of computing units from a group comprising a general computing unit, a graphics-based computing unit, a tensor-based computing unit, a neural processing unit and a field-programmable accelerator unit, wherein the adaptive allocation of the benchmarking processors is based on an allocation matrix in which dataset-specific characteristics, model-specific profile parameters and current hardware telemetry values are weighted and combined. System according to claim 1, characterized in that the resource metric unit is configured to acquire hardware-related measured values via generic hardware monitoring interfaces, energy consumption meters, power meters, memory access meters or processor utilization meters, wherein the acquired resource metrics are linked with the quality metrics and the input data characteristics by the correlation unit to determine data set-dependent performance deviations of the large language model. System according to claim 1, characterized in that the system further comprises a determinism control and a graphical display unit, wherein the determinism control is configured for the deterministic fixing of pseudo-random number generators, selection parameters, computational kernel selection and floating-point rounding modes, and wherein the graphical display unit is configured for the real-time display of the correlation between quality metrics, resource metrics, input data characteristics and error classes of faulty model outputs.