A server BIOS one-key tuning method and device
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHUOXIN (TIANJIN) INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
Existing server BIOS tuning solutions cannot perform security verification based on real-time hardware status before parameters are issued, nor can they continuously and dynamically optimize performance and power consumption balance based on actual load after parameters are issued, and they lack automatic hierarchical protection against hardware anomalies after tuning.
By creating a tuning template library and utilizing the mapping relationship between platform identifiers and CPU models, candidate tuning templates are automatically filtered, conflict detection is performed based on real-time hardware status, the optimal tuning template is determined, and BIOS parameter configuration is adjusted.
It automates and secures server BIOS parameter configuration, improves the convenience and accuracy of tuning, reduces the complexity of manual configuration and the risk of misoperation, ensures hardware security, and adapts to compatibility management across different CPU platforms.
Smart Images

Figure CN122309221A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of server performance tuning technology, and in particular to a one-click tuning method and apparatus for server BIOS. Background Technology
[0002] With the rapid development of technologies such as cloud computing, big data, and high-performance computing, server performance optimization has become a key means to improve computing resource utilization and reduce data center operating costs, as servers are the core computing infrastructure. In a typical cloud data center or artificial intelligence training cluster, the same rack may simultaneously deploy server equipment from different manufacturers and generations, each equipped with different models of central processing units (CPUs). These heterogeneous servers have significantly different hardware configurations, and the Basic Input / Output System (BIOS), as the firmware layer connecting the operating system and the underlying hardware, directly affects core indicators such as CPU performance, power consumption limits, and virtualization capabilities, playing a decisive role in server operating efficiency.
[0003] Currently, the industry primarily employs the following techniques for server BIOS tuning: First, the parameter traversal method based on simulation models: By constructing hardware simulation models for the CPU, memory, GPU, etc., candidate parameter combinations are traversed and their performance is scored to select the optimal configuration. Second, the manual configuration method: System administrators adjust dozens of parameters in the BIOS settings interface based on their experience, including CPU performance status, configurable thermal design power, multi-threading on / off, and I / O memory management unit. Third, the preset mode selection method: Server manufacturers pre-install several fixed parameter combinations in the BIOS firmware, forming preset modes such as high performance and energy saving, which users can select to activate.
[0004] However, the aforementioned existing technologies all have limitations to varying degrees in practical applications. First, existing technologies do not establish a mapping relationship between platform identifiers and CPU models, making it impossible to automatically select suitable tuning templates based on the CPU model of the server to be tuned. Different CPU platforms have different register address spaces and feature sets. If users manually select templates, they may choose incompatible configuration schemes due to a lack of understanding of hardware specifications, leading to register write errors or system instability. Second, the parameter traversal method based on simulation models faces the problem of complex modeling and large deviations. Simulation models are approximate abstractions of hardware behavior, making it difficult to accurately simulate the dynamic characteristics of the collaborative operation of subsystems such as CPU, memory, and I / O under real business loads. This results in a deviation between simulation scoring results and actual operating performance, and the generated parameter sets have poor compatibility. At the same time, the number of candidate parameter combinations is huge, and the traversal process is extremely time-consuming, making it difficult to meet the timeliness requirements of rapid deployment and elastic scaling in data centers. Although manual configuration avoids simulation deviations, it requires a high level of hardware knowledge from operations and maintenance administrators. There are complex dependencies and mutual exclusion relationships between dozens of parameters, making manual configuration prone to errors, and troubleshooting the root cause of incorrect configurations is time-consuming and labor-intensive. The preset mode selection method is the most convenient, but the parameter combinations it provides are static and fixed, and cannot be dynamically adjusted according to the real-time operating status of the server. If a user selects the high-performance mode in a high-temperature environment, it may cause the CPU to overheat and crash. In addition, all three existing technologies mentioned above lack the ability to perceive the actual operating status of the hardware and the ability to predict configuration conflicts. They cannot identify and block high-risk parameter combinations before the configuration takes effect; after the parameters are issued, they will not be continuously optimized according to load changes, and cannot achieve a real-time balance between performance and power consumption; at the same time, they lack automatic protection measures for abnormal hardware conditions after tuning. Once the tuning parameters cause overheating or power consumption to exceed the limit, the system can only passively wait for manual intervention and cannot automatically take protective actions such as degradation or rollback. Summary of the Invention
[0005] In view of this, this application provides a one-click server BIOS tuning method and apparatus to solve the problems of existing server BIOS tuning schemes being unable to perform security verification based on real-time hardware status before parameter issuance, unable to continuously and dynamically optimize performance and power consumption balance based on actual load after parameter issuance, and lacking automatic hierarchical protection for hardware anomalies after tuning.
[0006] Specifically, this application is implemented through the following technical solution:
[0007] The first aspect of this application provides a one-click BIOS tuning method for a server, the method comprising:
[0008] Create a tuning template library, which includes multiple tuning templates. Each tuning template has a unique platform identifier, and the platform identifier has a mapping relationship with the CPU model.
[0009] Based on the CPU model of the server to be tuned and the mapping relationship, a set of candidate tuning templates is determined in the tuning template library platform layer;
[0010] The target optimization template is determined from the candidate optimization template set based on the scenario instructions;
[0011] Based on the real-time hardware status of the server to be tuned, conflict detection is performed on the target tuning template, and the optimal tuning template is determined based on the conflict detection results.
[0012] The optimal tuning template parameters are sent to the server to be tuned to complete the BIOS parameter configuration adjustment.
[0013] A second aspect of this application provides a one-click server BIOS tuning device, the device comprising a creation module, a determination module, a detection module, and a processing module;
[0014] The creation module is used to create a tuning template library, which includes multiple tuning templates. Each tuning template has a unique platform identifier, and the platform identifier and CPU model have a mapping relationship.
[0015] The determining module is used to determine a set of candidate tuning templates in the tuning template library platform layer based on the CPU model of the server to be tuned and the mapping relationship;
[0016] The determining module is further configured to determine the target optimization template from the candidate optimization template set based on the scenario instructions;
[0017] The detection module is used to perform conflict detection on the target tuning template based on the real-time hardware status of the server to be tuned, and to determine the optimal tuning template based on the conflict detection results.
[0018] The processing module is used to send the optimal tuning template parameters to the server to be tuned to complete the BIOS parameter configuration adjustment.
[0019] The server BIOS one-click tuning method and apparatus provided in this application, on the one hand, improves the convenience and efficiency of tuning by integrating multiple types of tuning templates in the template library for adaptive selection by server tuning, and reduces the technical experience requirements for tuning. On the other hand, in order to improve convenience while ensuring tuning accuracy, scenario matching and real-time hardware status conflict detection ensure accurate matching between templates and tuning environment and objects. The mapping relationship between platform identifiers and CPU models realizes the compatibility management of multiple hardware platforms by the template library. Furthermore, a conflict detection mechanism based on real-time hardware status is introduced before parameter issuance, and security verification is made a necessary step in the tuning process. Overall, the server BIOS parameter configuration is automated and secure, reducing the complexity of manual configuration and the risk of misoperation. By creating a tuning template library with unique platform identifiers and establishing a mapping relationship between platform identifiers and CPU models, different CPU platforms can share the same template management architecture, solving the problem of templates being bound to platforms and unable to be reused across platforms in existing solutions. Based on the CPU model and mapping relationship, a set of candidate tuning templates is determined at the platform level, and the system automatically selects available templates based on hardware characteristics, avoiding the security risks of users selecting incompatible templates due to unfamiliarity with hardware specifications. The target tuning template is determined from the candidate set based on scenario instructions, and register parameters are encapsulated into a single scenario selection operation, allowing users to make tuning decisions without understanding the meaning of underlying parameters. Conflict detection is performed on the target template based on real-time hardware status, and the optimal tuning template is determined accordingly. High-risk parameter combinations that may endanger hardware security are detected and blocked before the parameters take effect, making up for the technical deficiency of static templates that cannot perceive the current operating status of the server. Attached Figure Description
[0020] Figure 1 A flowchart of Embodiment 1 of the server BIOS one-click tuning method provided in this application;
[0021] Figure 2 This is a schematic diagram of the second embodiment of the server BIOS one-click tuning device provided in this application. Detailed Implementation
[0022] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
[0023] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used herein are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
[0024] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."
[0025] The following specific embodiments are given to illustrate the technical solution of this application in detail.
[0026] Example 1
[0027] Figure 1 This is a flowchart of Embodiment 1 of the server BIOS one-click tuning method provided in this application. Please refer to... Figure 1 The method provided in this embodiment may include:
[0028] S101. Create a tuning template library, which includes multiple tuning templates. Each tuning template has a unique platform identifier, and the platform identifier and CPU model have a mapping relationship.
[0029] It should be noted that the tuning template library is used to store and manage BIOS parameter configuration schemes for different CPU platforms and application scenarios. Each tuning template in the library has a unique platform identifier, which is mapped to the CPU model. Specifically, the platform identifier is generated based on the CPU's CPUID information. The CPUID is a set of identification registers inside the CPU, and its return value includes fields such as manufacturer identifier, serial number, model, and stepping. In the x86 architecture, the complete CPU identity can be obtained by executing the CPUID instruction and specifying different input parameters, such as EAX=1 returning Family / Model / Stepping information; in the ARM architecture, similar information is obtained by reading the main ID register. The platform identifier is generated from the aforementioned CPUID fields according to preset encoding rules, such as concatenating the manufacturer identifier, serial number, model, and stepping fields and then performing a hash operation to obtain a unique identifier value.
[0030] Each tuning template has a unique platform identifier. Different tuning templates belonging to the same CPU platform share the same platform identifier, while tuning templates for different CPU platforms have different platform identifiers. Through this mapping relationship, the BIOS or its server system can automatically locate a compatible set of templates based on the CPU model of the server to be tuned in subsequent steps.
[0031] Optionally, the tuning template library is stored in the BIOS's non-volatile random access memory (NVRAM) or in the Baseboard Management Controller (BMC) flash memory. The BMC is a management controller that operates independently of the server's main processor and can monitor and manage the server hardware through an out-of-band management channel.
[0032] The construction of the optimization template library includes a hierarchical inheritance structure for templates and a template generation method. Firstly, the optimization template library adopts a four-level hierarchical template structure, from bottom to top: root template, platform template, scene template, and custom template. Inheritance is used between layers; lower-level templates inherit all parameter definitions from upper-level templates and can then be extended or modified accordingly.
[0033] At the database storage level, the four-level hierarchical template structure is stored using a hierarchical indexing method. The root template, serving as the top-level index, is the least numerous, typically containing only one or a few sets of basic parameter definitions. All platform templates are associated with the root template through a root template identifier. Platform templates use the platform identifier as the primary index, with each platform template corresponding to a CPU platform type. Multiple scene templates are associated under the same platform template, with the number of scene templates exceeding the number of platform templates. Scene templates use scene type codes as secondary indexes, and multiple custom templates can be associated under each scene template, with the number of custom templates exceeding the number of scene templates, forming a tree-like storage structure that expands layer by layer from top to bottom.
[0034] During template retrieval, the system first determines the corresponding platform identifier based on the CPU model of the server to be optimized. The system then directly locates the corresponding platform template using this identifier. All scene templates and custom templates under the same platform identifier are stored in the same storage branch, allowing for rapid location of the candidate template set via indexing. Custom templates employ a copy-on-write strategy, storing only the parameter items that differ from the inherited scene template. Unmodified parameter items are retrieved through index references pointing to the inherited scene template, avoiding the complete copying of all parameter data for each custom template.
[0035] The root template defines the most basic register operation permissions and security boundaries, including the CPU register read / write permission mask, and the upper and lower security limits for parameter values. For example, the maximum value of Configurable Thermal Design Power (cTDP) cannot exceed the power limit defined in the CPU hardware specification. Once this constraint is defined in the root template, all lower-level templates must comply with it, thus eliminating the possibility of exceeding the limit at the architectural level.
[0036] The platform template inherits all constraints from the root template and populates the default register address table for a specific CPU platform. Because different CPU platforms (such as Intel Sapphire Rapids, AMD Genoa, etc.) have different register address spaces, the platform template records the correct register addresses for each platform, such as the P-State request register and cTDP upper limit register, to ensure correct addressing when subsequent parameters are issued. Simultaneously, the platform template carries a CompatibilityHeader, which contains the Platform ID and CPU Signature Mask, used for matching and verification with the current CPU's CPUID during template loading. The CPUID is the CPU's identifier register, containing information such as manufacturer, series, model, and stepping. By comparing the CPUID with the template's CPU Signature Mask, it can be determined whether the template is suitable for the current CPU platform. The platform template provides a general parameter base applicable to all scenarios under the same CPU type. However, because different application scenarios have different emphases on performance, power consumption, virtualization, and other indicators, relying solely on the general configuration of the platform template cannot achieve optimal performance in specific scenarios, resulting in issues of uneven accuracy across scenarios and the inability to perform targeted optimization. This problem is solved by lower-level scene templates and custom templates. While inheriting all parameters from the platform templates, the scene templates optimize the parameters for specific scenarios such as high performance, energy saving, and virtualization, thus making up for the lack of scene generalization of the platform templates.
[0037] Scenario templates inherit from platform templates, setting corresponding parameter values for different application scenarios. Typical scenarios may include: high-performance scenarios, energy-saving scenarios, and virtualization scenarios. The differences between scenarios are reflected in the specific parameter values. For example, in high-performance scenarios, P-State is set to P0 (the highest frequency of all cores), and the cTDP limit is manually increased to unleash performance potential. In virtualization scenarios, Input-Output Memory Management Unit (IOMMU) and Single Root I / O Virtualization (SR-IOV) support are enabled to improve device pass-through performance in virtualized environments. Scenario templates distinguish between CPU-related parameters and scenario-related parameters in parameter configuration. CPU-related parameters are inherited from the platform template and remain integrated, ensuring that platform-level constraints such as register addresses and hardware security boundaries are not affected by scenario switching. Scenario-related parameters are optimized for each scenario to achieve their own optimal performance. For example, in high-performance scenarios, P-State and cTDP are adjusted to a combination that maximizes performance, while in energy-saving scenarios, they are adjusted to a combination that optimizes energy efficiency. In this way, when users switch between different scenarios, the underlying security configuration related to the CPU platform remains unchanged, and only the scenario-related performance parameters are adjusted as needed, balancing security and scenario optimization. It should be noted that the scenario template is the object that users directly select when performing one-click optimization.
[0038] Custom templates, inheriting from a scene template, allow users to modify non-critical parameters based on the scene template. A copy-on-write (COW) strategy is used to store the modified parameter items. COW means that when a user needs to modify a parameter, the system does not directly overwrite the data in the original scene template. Instead, it saves only the modified parameter item and its modified value in the custom template's storage space, while unmodified parameter items continue to use the default values from the inherited scene template. This strategy saves storage space by avoiding copying all parameter data for each user template and also maintains the inheritance relationship between templates, facilitating subsequent maintenance and auditing.
[0039] At the data structure level, each tuning template consists of three main parts: a template header, platform compatibility information, and a parameter combination list. The template header includes a template identifier and a description. The template identifier, composed of a platform identifier, scenario type code, and version number, uniquely identifies a tuning template. The description records the applicable scenarios and key parameter configuration points for the template. The platform compatibility information includes a platform identifier and a CPU signature mask. The platform identifier is used for mapping and matching with CPU models, and the CPU signature mask defines the manufacturer identifier, serial number, model, and stepping range of the CPUs compatible with the template. The parameter combination list is an array of parameter entries. Each parameter entry includes a register address, parameter name, current parameter value, default value, parameter value range, and parameter type field. The parameter type identifies whether the parameter is critical or non-critical. Critical parameters are inherited from upper-level templates and cannot be modified by lower-level templates, while non-critical parameters can be adjusted within lower-level templates.
[0040] After completing the design of the four-level hierarchical template structure, the specific parameter values filled in each scenario template were determined through offline benchmark testing and multiple linear regression fitting. The above four-level hierarchical template structure defines the logical relationships and inheritance rules between templates, while the parameter values that each scenario template should be filled with to achieve optimal performance in that scenario need to be determined by analyzing the operating data of real hardware.
[0041] This embodiment uses offline benchmark testing instead of simulation models to determine the parameter values of the scene template. Compared with the existing technology of building simulation models of CPU, memory, etc. for traversal scoring, offline benchmark testing is based on real hardware operation data, avoiding the problem that simulation models are difficult to accurately simulate the dynamic characteristics of hardware collaborative operation. The generated template parameters are more adaptable to actual needs.
[0042] Specifically, the creation of the tuning template library includes:
[0043] (1) Determine the platform template inherited by the scene template and define the scene optimization target.
[0044] Before generating a scenario template, the platform template integrated with that scenario template is first determined. The scenario template inherits all CPU-related parameters (including register address tables, hardware security boundaries, etc.) from the specified platform template to ensure that platform-level constraints are not violated during scenario optimization. Then, scenario optimization goals are defined according to business requirements. For example, high-performance scenarios aim to maximize the performance / power consumption ratio, energy-saving scenarios aim to minimize power consumption while maintaining performance at or above a preset baseline, and virtualization scenarios aim to maximize the performance / power consumption ratio while ensuring IOMMU and SR-IOV are enabled.
[0045] (2) Run a standard load on the corresponding CPU platform and record the performance data under different parameter combinations.
[0046] A standard workload is run on the CPU platform corresponding to the platform template integrated into the scene template. This standard workload may include: the SPECpower benchmark for evaluating the overall energy efficiency of the server, the Stream benchmark for evaluating memory bandwidth, and the Flexible I / O Tester (FIO) benchmark for evaluating storage I / O performance. During operation, performance and power consumption data are simultaneously recorded under different CPU performance state (P-State) levels, different cTDP settings, and different combinations of Determinism Slider (CPU performance deterministic sliding controller used to control the stability and fluctuation range of CPU frequency), forming a raw data matrix. Here, P-State represents the CPU performance state level, with P0 representing the highest performance state (highest frequency) and Pn representing the lowest performance state (lowest frequency, maximum power saving); cTDP is the configurable thermal design power, allowing adjustment of the processor's power consumption limit within the CPU hardware specifications.
[0047] (3) Use regression algorithm to calculate the weight coefficients of each parameter to the scene optimization target.
[0048] The original data matrix was processed using a multiple linear regression algorithm to obtain the weight coefficients of each BIOS parameter on the scene optimization target indicator (such as performance per watt). Multiple linear regression is a statistical analysis method used to establish a linear relationship model between multiple independent variables and a dependent variable. The independent variables are adjustable parameters such as P-State level, cTDP value, and DeterminismSlider value, while the dependent variable is the scene optimization target indicator. The weight coefficients calculated from the regression reflect the relative contribution of each parameter to the scene optimization target.
[0049] (4) Sort and filter the parameter combinations based on the weight coefficients to determine the optimal parameter set for the scenario.
[0050] The parameter combinations are sorted according to their weighting coefficients, and the parameter values that contribute the most to the optimization goal of the scenario are selected as the optimal parameter values for that scenario. Based on this, the remaining parameters are then fine-tuned to ensure optimal overall performance of the parameter combinations within the given optimal parameter values. For example, in a high-performance scenario, after the P-State with the highest weighting coefficient is determined to be P0 (the highest frequency across all cores), the cTDP value is adjusted to the minimum upper limit that can support the continuous operation of P0, thus forming the optimal parameter set for that scenario.
[0051] Optionally, the base template and the generated scene template are stored in a structured binary format, including fields such as Template ID, description information, compatibility tag header, number of parameters, and parameter entry array. A Cyclic Redundancy Check (CRC32) checksum is calculated to ensure storage integrity. CRC32 is a commonly used data integrity verification algorithm that generates a checksum by performing a polynomial calculation on the data content. If the data has been illegally tampered with, the checksum will not match, thus detecting data corruption or tampering.
[0052] S102. Based on the CPU model of the server to be tuned and the mapping relationship, a set of candidate tuning templates is determined in the tuning template library platform layer.
[0053] It's important to note that after creating the tuning template library, when performing one-click BIOS tuning on a specific server, not all templates in the library should be directly presented to the user for selection. Different CPU platforms have different register address spaces, supported feature sets, and hardware constraints. Loading and distributing templates incompatible with the current platform could lead to register write errors, system instability, or even hardware damage. Therefore, it's necessary to first select a suitable set of candidate templates from the library based on the CPU model of the server to be tuned.
[0054] It should be noted that the candidate tuning template set filtered by S102 includes all scenario templates and custom templates associated with a successfully matched platform template. In the four-level hierarchical template structure, the root template, as the underlying security constraint, has its defined register operation permissions and security boundaries inherited by all platform templates and does not require separate matching during the S102 filtering phase. The platform template carries a compatibility tag header, serving as the direct object for matching the CPU model. Scenario templates and custom templates, as lower-level templates of the platform template, automatically include all associated scenario templates and custom templates in the candidate tuning template set once a platform template is successfully matched. In this way, S102 filtering uses the platform template as the primary index. After a platform template is matched, all available scenario tuning solutions for that CPU platform are presented to the user at once, balancing filtering efficiency with the completeness of available solutions.
[0055] Specifically, during the boot phase, the BIOS obtains the CPU model by reading the CPUID information of the CPU in the server being tuned. The CPUID is a set of internal identification registers within the CPU, and its return value includes fields such as manufacturer identifier, serial number, model, and stepping. In the x86 architecture, the complete CPU identity can be obtained by executing the CPUID instruction and specifying different input parameters, such as EAX=1 returning Family / Model / Stepping information. In the ARM architecture, similar information is obtained by reading the Main ID Register (MIDR).
[0056] After obtaining the CPU's CPUID, the BIOS matches and verifies it against the compatibility tag headers of various platform templates in the tuning template library. It's important to note that the CPU model is only matched against platform templates. Scenario templates and custom templates, as lower-level templates of the platform templates, do not directly participate in CPUID matching; instead, they are indirectly associated through their respective platform templates. The platform template is the only template level in the four-level hierarchical structure that carries a compatibility tag header. Root templates, scenario templates, and custom templates do not carry compatibility tag headers; therefore, only platform templates can be matched and verified against CPU models. The compatibility tag header, as described in S101, contains two fields: platform identifier and CPU signature mask. The CPU signature mask defines the CPU Family, Model, and Stepping range compatible with the template. It is a bitmask; by performing a bitwise AND operation between the CPU's CPUID and the mask, it can be quickly determined whether the current CPU falls within the template's compatibility range. If the result matches the template's pre-stored target value, the match is successful, and the template is added to the candidate tuning template set; if none of the platform templates match, the candidate set is empty.
[0057] Furthermore, once the candidate tuning template set is determined, the BIOS can assess its validity. If the candidate set is empty, meaning no template is compatible with the current CPU platform, the BIOS will output a prompt message through the user interface, such as displaying "No available tuning templates for the current platform" in the BIOS setup interface, and terminate the one-click tuning process. If the candidate set contains only one template, it will be directly used as a candidate; if the candidate set contains multiple templates, such as multiple scenario templates for the same CPU platform corresponding to high performance, energy saving, virtualization, etc., the user will then select from them using scenario commands.
[0058] Optionally, when the candidate template set is presented in the BIOS settings interface, template options that are incompatible with the current platform can be grayed out and accompanied by a "This template is not applicable to this platform" message. The platform identification differences between templates can also be visualized, for example, grouped by CPU platform, to help users understand the applicable scope of each template. In batch tuning scenarios, when the management node needs to perform one-click tuning on multiple servers simultaneously, the template matching status of each node can be centrally displayed through the BMC WebUI, and nodes with empty candidate sets can be individually marked and alerted.
[0059] Optionally, the compatibility check results can be recorded in the BMC system event log (SEL), including fields such as check time, CPU model, CPUID value, list of successfully matched templates or reasons for failed matching, to facilitate subsequent operation and maintenance audits and fault tracing.
[0060] Through the steps described above, S102 completes the process of filtering candidate tuning template sets from the template library based on the CPU model. This firstly, through mandatory matching and verification between the CPUID and the compatibility tag header, the possibility of writing incompatible template parameters into the CPU registers is eliminated. Different CPU platforms have different register addresses and parameter definitions; if these are issued without verification, incorrect addresses or illegal values may be written to the registers, causing system instability or even hardware damage. Secondly, users do not need to determine which CPU platform their server uses or which template to select; the system automatically completes the filtering, presenting only safe options to the user, thus lowering the technical threshold and probability of error in BIOS tuning. Furthermore, when dealing with heterogeneous server clusters containing CPUs from different manufacturers and generations, a unified template library can cover multiple platforms. This automatic matching mechanism allows the same tuning method to seamlessly adapt to different hardware environments, improving operational efficiency.
[0061] S103. Determine the target tuning template from the candidate tuning template set based on the scenario instructions.
[0062] After determining the candidate optimization template set, if the candidate set contains multiple applicable templates, the user needs to select one as the target for this optimization based on actual business needs. Scenario instructions are input signals from which the user expresses their business intent. In this embodiment, scenario instructions are obtained through the following two methods:
[0063] Firstly, it can be obtained through the BIOS setup interface. During server startup, users enter the BIOS setup interface (SetupUtility) and see a list of candidate scenarios presented by the system in the "One-Click Tuning" function menu. This list is filtered by S102 and only includes templates compatible with the current CPU platform. Users select a scenario using the keyboard or mouse, and the BIOS recognizes this selection as a scenario command.
[0064] Secondly, it can be obtained through the BMC Web management interface. For remote management scenarios where maintenance personnel cannot physically access the server, users can log in to the BMC Web management console (WebUI) through a browser and invoke the one-click tuning function in the management interface. The BMC WebUI transmits the scenario identifier selected by the user to the BIOS through the Redfish event service or IPMI OEM commands, and the BIOS determines the scenario command accordingly.
[0065] It should be noted that the scenarios referred to in the scenario instructions correspond to the parameter configuration schemes defined in the scenario templates of the tuning template library. As described above, typical scenario definitions include high-performance scenarios, energy-saving scenarios, and virtualization scenarios. The high-performance scenario aims to maximize computing performance by setting the CPU's P-State to P0, enabling TurboMode (allowing short-term overclocking within acceptable heat dissipation and power consumption limits), setting the Determinism Slider to Performance mode (maintaining a high CPU frequency and reducing fluctuations), and manually increasing cTDP to the upper limit allowed by the hardware specifications. This scenario is suitable for workloads sensitive to computational latency and intolerant of performance fluctuations, such as high-frequency trading and scientific computing.
[0066] The energy-saving scenario aims to minimize power consumption. The P-State control mode is set to Auto (the operating system or hardware automatically selects the optimal energy-efficient state), global C-State control is enabled (allowing the CPU to enter deep sleep mode when idle to save power), and cTDP is set to Auto, allowing the hardware to automatically adjust the power consumption limit based on the load. This scenario is suitable for non-critical businesses with large load fluctuations and a focus on energy efficiency.
[0067] The virtualization scenario aims to optimize the virtualization environment by enabling IOMMU (Input / Output Memory Management Unit, which provides DMA address translation and memory protection for virtual machines) and SR-IOV (Single Root Input / Output Virtualization, which allows a single physical PCIe device to be directly shared and accessed by multiple virtual machines), enabling SVM Control (Secure Virtual Machine Control, i.e., hardware virtualization support for AMD platforms), and setting Determinism Slider to Performance mode to ensure the frequency stability of the load within the virtual machine.
[0068] It should be noted that the parameter values in the above scenario definition have been determined through offline benchmark testing during the template generation stage of S101. Users only need to select the scenario name and do not need to manually configure the underlying register parameters.
[0069] When a user issues a scenario command through the above methods, the BIOS searches for a scenario template that matches the scenario command in the candidate tuning template set. The matching method is to compare the scenario identifier in the scenario command with the template identifiers of each scenario template in the candidate set. If they match, the template is selected as the target tuning template.
[0070] If the scene identifier in the scene instruction does not completely match the template identifiers of all scene templates in the candidate set, scene similarity calculation is performed. The scene similarity calculation method is as follows: extract the scene type code from the template identifier of the scene template, calculate the matching degree between the scene type in the scene instruction and the scene type codes of all scene templates in the candidate set, and determine the scene template with the highest matching degree as the most similar template. The scene type code carries feature information about the optimization target of the scene template. For example, the scene type code of a high-performance scene contains parameter combination feature identifiers aimed at performance optimization, the scene type code of an energy-saving scene contains parameter combination feature identifiers aimed at power consumption optimization, and the scene type code of a virtualization scene contains feature identifiers with IOMMU and SR-IOV enabled as prerequisites. Through feature extraction and matching degree calculation of the scene type code, the system automatically recommends the most similar scene template as the target tuning template when an exact match cannot be found. If the user accepts the recommendation, the template is directly used as the target tuning template; if the user does not accept it, they can manually select another template or create a custom template based on the most similar template.
[0071] When users manually select other templates or create custom templates based on the most similar template, they can obtain the scenario type code and description information of each scenario template from the recommendation page, and determine which scenario template's optimization goal best matches their current needs based on their own business requirements. In addition to selecting from preset scenarios, this embodiment also supports users creating custom templates based on scenario templates. Users can select a scenario template in the BIOS settings interface, modify non-critical parameters, and generate a new custom template using the "Save As Custom Template" function. Before saving the custom template, the BIOS performs configuration conflict checks on the user-modified parameter combinations. If the check fails, the user is prompted with the specific conflict items and saving is prevented; if the check passes, a CRC32 checksum is calculated, and the CRC32 checksum and the template data are written to NVRAM. Thus, the custom template appears in the candidate scenario list, and users can call it using scenario commands just like selecting preset scenarios.
[0072] Optionally, the BIOS settings interface includes brief descriptions for each scenario option, such as "High-performance scenario: suitable for compute-intensive services with higher power consumption" and "Energy-saving scenario: suitable for non-critical services with low load fluctuations, prioritizing energy efficiency," to help users quickly understand the applicable conditions for each scenario. Simultaneously, template selection operations can be recorded in the BMC SEL log, including operation time, user identity (identified by IPMI session ID), selected scenario name, and template version, facilitating operation and maintenance auditing.
[0073] This design lowers the barrier to entry. Traditional BIOS tuning requires administrators to manually configure dozens of register parameters one by one. These parameters are interconnected, making troubleshooting configuration errors costly. This method encapsulates parameter combinations into scenario templates, allowing users to complete tuning simply by selecting a business scenario without needing to understand the underlying register meanings. Furthermore, preset scenarios cover typical business needs, while custom templates meet individual requirements. Custom templates undergo parameter pre-verification before saving to ensure that even after user modifications, the final parameter combinations still meet security constraints. It's also worth noting that the candidate scenario list and scenario selection logic remain consistent whether using the local BIOS interface or the BMC remote management interface, ensuring a unified operational experience for both local and remote maintenance.
[0074] S104. Based on the real-time hardware status of the server to be tuned, perform conflict detection on the target tuning template, and determine the optimal tuning template based on the conflict detection results.
[0075] It's important to note that after determining the target tuning template, its parameters cannot be directly sent to the CPU registers. This is because the parameter values defined in the template are ideal values determined in an offline benchmark testing environment, while the current hardware operating state of the server to be tuned may differ significantly from that during offline testing. For example, if the server's CPU temperature is already high due to poor heat dissipation, sending a high-performance template that manually increases cTDP may trigger overheat protection or even hardware damage. Therefore, a safety check based on the real-time hardware state, i.e., conflict detection, is required before sending parameters. Only templates that pass the detection can become the final optimal tuning template.
[0076] Specifically, the conflict detection of the target tuning template based on the real-time hardware status of the server to be tuned includes:
[0077] (1) Obtain the real-time CPU operating status of the server to be tuned from the BMC through out-of-band management communication protocol.
[0078] It should be noted that the acquisition of real-time hardware operating status relies on the BMC, a management controller that operates independently of the server's main processor. The BMC monitors and manages the server hardware through out-of-band management channels, and it continues to function even when the main processor is powered off. In this embodiment, the BIOS interacts with the BMC via an out-of-band management communication protocol. An out-of-band management communication protocol refers to a management communication protocol that does not rely on the main operating system's network stack and is independent of the main data channel. Typical implementations include the communication protocol defined by the Intelligent Platform Management Interface (IPMI) specification and the Redfish protocol. IPMI is a mature hardware management interface standard that defines the request / response message format between the BMC and system software; Redfish is a modern management standard developed by the Distributed Management Task Force (DMTF) based on a RESTful API, using HTTPS and JSON for data transmission. The advantage of using an out-of-band management communication protocol is that the BIOS can directly obtain low-level hardware sensor data from the BMC without going through an operating system agent, resulting in a shorter data path, lower response latency, and independence from the operating system's state.
[0079] It should also be noted that the real-time operating status obtained by the BIOS from the BMC through the aforementioned protocol specifically includes CPU temperature, CPU power consumption, CPU load, and CPU health status. Specifically, the CPU temperature is obtained as follows: the BMC reads the temperature value directly from the CPU's Digital Thermal Sensor (DTS) via the Platform Environment Control Interface (PECI) bus. PECI is a single-wire serial interface defined by Intel, allowing the BMC to directly access the temperature monitoring circuitry within the CPU package, resulting in high data accuracy and fast response. The DTS output is the difference from the maximum junction temperature (Tjmax), expressed in degrees Celsius (°C); a smaller value indicates a higher temperature.
[0080] The CPU power consumption is obtained by the BMC reading the current and voltage data from the Voltage Regulator Module (VRM) on the motherboard to calculate the CPU's real-time power consumption. The VRM is responsible for converting the voltage provided by the Power Supply Unit (PSU) into the core voltage required by the CPU, and its output current multiplied by the voltage is the CPU power consumption.
[0081] The CPU load is obtained by the BMC via the PECI bus or IPMI interface, which acquires CPU utilization data, including overall utilization and the independent utilization of each core. The load data reflects the current computational workload of the CPU.
[0082] Furthermore, CPU health status refers to a set of operational status indicators that can be precisely measured down to the individual physical core level, including whether the core is online, core temperature, core frequency, core voltage, and whether the core has generated correctable errors. With CPU health status monitoring at the individual physical core level, when a core malfunctions, the system can accurately locate the problematic core and take targeted action without having to degrade the entire CPU package.
[0083] Optionally, the BIOS also includes exception handling logic when acquiring the real-time operating status of the hardware. When the BIOS sends an IPMI request to the BMC, if no response is received within a preset timeout threshold (e.g., 3 seconds), the request is retried. If no response is received after three consecutive retries, the BMC communication is deemed abnormal, the one-click tuning process is exited, a "BIOS tuning communication lost" event is recorded in the BMC SEL log, and the current operating parameters are reverted to the static template preset values. When the read sensor value is invalid (some sensors return specific invalid values, such as 0xFF, when malfunctioning), the previously read valid value is used, and the data stale flag is set. If invalid values are read multiple times consecutively (e.g., five times), a degradation protection strategy is triggered, forcibly limiting the maximum P-State value to Pn-2 (i.e., two levels higher than the lowest power consumption state), reducing the risk of misjudgment during tuning due to sensor malfunction.
[0084] (2) Call the preset parameter conflict rule matrix to match the BIOS parameter combination in the target tuning template with the real-time running status of the CPU to determine whether there is a configuration conflict.
[0085] After acquiring the CPU's real-time operating status, the BIOS calls a preset parameter conflict rule matrix to perform conflict detection. The parameter conflict rule matrix is a set of rules built into the BIOS firmware, used to define which BIOS parameter combinations are prohibited or require warnings under specific hardware states. Each rule in the matrix contains four fields: Condition A, Condition B, conflict determination logic, and conflict type. By jointly determining Condition A and Condition B, it identifies whether a security risk exists.
[0086] Unlike simple threshold comparisons, the core value of the parameter conflict rule matrix lies in its ability to detect multi-dimensional cross-conflicts. It not only determines whether a single parameter exceeds the limit, but also discovers potential logical contradictions between parameters and between parameters and hardware states.
[0087] The classification of rules in the parameter conflict rule matrix is based on the objects affected by the conflict results, thus falling into three categories: fatal conflicts affecting hardware security, functional mutual exclusion conflicts affecting the integrity of software functions, and logical contradictions affecting the rationality of configuration logic. These three types of conflict rules are arranged in descending order of the severity of the affected objects. Fatal conflicts directly affect hardware security and are the highest priority blocking rules; functional mutual exclusion conflicts cause software malfunctions and are mandatory blocking rules; logical contradictions affect the clarity of configuration intent and are warning rules.
[0088] At the data structure level, each conflict rule includes four fields: Parameter Condition A, Parameter Condition B, Condition Logical Relationship, and Hit Result Type. Parameter Condition A and Parameter Condition B define two configuration conditions to be compared. Each configuration condition consists of a parameter name and a desired parameter value, which can be a specific numerical value, a Boolean state, or a range expression. The Condition Logical Relationship defines the decision relationship between Parameter Condition A and Parameter Condition B, including the three logical operators AND, OR, and XOR. The Hit Result Type identifies the response level to be taken after the rule is triggered, with response levels including fatal block, forced block, and warning.
[0089] The conflict detection process consists of four stages: The first stage is parameter condition extraction. The current values of each parameter in the target tuning template are extracted according to their names to form a list of parameters to be verified. Simultaneously, real-time CPU operating status data obtained from the BMC is also included in the list of parameters to be verified, in the form of parameter names and corresponding values. The second stage is rule traversal. Each rule in the parameter conflict rule matrix is traversed, and it is determined whether parameter condition A and parameter condition B of the rule are simultaneously matched in the list of parameters to be verified. The matching criterion is that the value of the corresponding parameter name in the list of parameters to be verified satisfies the expected value of the parameter defined in the rule. The third stage is conflict determination. When both parameter condition A and parameter condition B match, a conflict is triggered based on the logical relationship of the rule's conditions. If the logical relationship is AND, a conflict is triggered when both conditions are met; if the logical relationship is OR, a conflict is triggered when at least one condition is met; if the logical relationship is XOR, a conflict is triggered when exactly one of the two conditions is met. The fourth stage is result output. The rules that trigger conflicts are summarized and output, sorted according to the severity of the hit result type.
[0090] When a conflict rule is triggered, the parameter names corresponding to parameter conditions A and B in that rule are the specific conflict items. When generating the prompt message, the BIOS extracts the parameter names involved in the rule along with the conflict type and conflict description information as the source of the specific information for the conflict item. When writing back to the BIOS settings interface, the BIOS locates the setting item corresponding to the parameter name in the conflict item and highlights it. When multiple conflict items exist simultaneously, the BIOS presents them in the order of priority: fatal block, forced block, and warning prompt. Conflicts of the same level are presented according to the traversal order of the rules in the matrix. Each conflict item carries a corresponding error code, parameter name, current value, and suggested correction direction.
[0091] In this embodiment, the parameter conflict rule matrix defines three different types of conflict determination logic:
[0092] The first type of conflict, also known as a fatal conflict, is defined as a severe conflict between the real-time operating state of the hardware and specific BIOS parameters that poses a risk of hardware damage. For example, when the CPU's current package temperature (Package Temperature) exceeds the high-temperature threshold, it is strictly forbidden to set the cTDP manual control bit to Enabled. The high-temperature threshold is set differently depending on the CPU platform: 85°C for Intel platforms and 80°C for AMD platforms (the maximum package temperature (Tjmax) on AMD platforms is usually lower than that on Intel platforms). Manually enabling cTDP allows the power consumption limit to be increased. In high-temperature environments, increasing power consumption is equivalent to increasing heat generation, which may trigger overheat protection, leading to unexpected system crashes or permanent damage to the CPU. Once a fatal conflict is triggered, the system must prevent the parameter from being issued, and users should not be allowed to bypass this blockade.
[0093] The second type of conflict is the functional mutual exclusion conflict, defined as contradictory configurations between BIOS parameters due to functional dependencies. For example, if SVM Control (Secure Virtual Machine Control) is enabled, and IOMMU (Import / Output Memory Management Unit) is disabled, a functional mutual exclusion occurs. This is because the virtualization assistance function SVM relies on the IOMMU for DMA address translation and memory isolation; disabling the IOMMU will disable some of SVM's security isolation functions. Similarly, enabling SR-IOV support requires the IOMMU to be enabled, as SR-IOV relies on the IOMMU to provide independent address spaces for different virtual machines. When a functional mutual exclusion conflict is triggered, the system prevents parameter issuance and prompts the user to correct the configuration.
[0094] The third type of conflict, or logical contradiction, is defined as a warning-type conflict where BIOS parameter values do not conform to the preset logical relationship. For example, when P-State is set to P0 (maximum all-core frequency) while Turbo Mode is set to Disabled. P0 is the maximum all-core frequency state, while Turbo Mode allows short-term overclocking of a single core or a few cores. Although there is a difference, logically, a normally functioning server would not set the performance cores to the maximum frequency while simultaneously disabling Turbo Boost. While this type of conflict does not directly cause hardware damage, it may reflect an unclear configuration intent, and the system will issue a warning.
[0095] Optionally, the parameter conflict rule matrix supports dynamic expansion. When a new CPU platform introduces new configurable parameters, or when new high-risk parameter combinations are discovered in operational practice, new rules can be added to the matrix via BIOS firmware upgrades without modifying the main conflict detection logic. The rule matrix is stored in a structured format in the BIOS firmware, facilitating upgrades via firmware update files.
[0096] (3) When there is a configuration conflict, the parameter distribution process is blocked, a prompt message containing the specific information of the conflict item is generated, and the conflict item is written back to the BIOS settings interface.
[0097] It should be noted that after the conflict detection is completed, the BIOS will process the conflict accordingly based on the detection results. If there are no configuration conflicts, the target tuning template will be determined to have passed the security verification, and the template will be identified as the optimal tuning template and the distribution process will begin.
[0098] If a configuration conflict exists, regardless of whether it's a fatal conflict, a functional mutual exclusion conflict, or a logical contradiction, the BIOS will block the parameter delivery process. This means preventing parameters from the current template from being written to the CPU registers, thus eliminating the possibility of high-risk configurations taking effect at the source. Simultaneously, the BIOS generates a message containing detailed information about the conflicting item and writes it back to the BIOS setup interface. The message includes a clear error code, the name of the parameter involved in the conflict, its current value, and a suggested correction direction. After being written back to the BIOS setup interface, the conflict-related parameter items are highlighted, guiding the user to manually correct the unreasonable configuration items. After correction, the user can re-trigger the one-click tuning process.
[0099] This blocking and write-back approach prevents dangerous operations while providing users with precise error location and correction guidance. Users do not need to troubleshoot which parameter caused the verification failure, reducing troubleshooting costs and operational barriers.
[0100] It should be noted that conflict detection is initiated during the template loading phase. As described in S103, conflict detection is also triggered when the user saves a custom template to ensure that the template stored in NVRAM does not contain conflicting configurations. The verification performed before S104 is issued is based on the real-time hardware status, because even if the detection passes when the template is created, the real-time status of the server may have changed after the template is created.
[0101] Based on the preceding description, setting up verification based on real-time hardware status before writing parameters to the CPU register provides a more secure design. It also offers precise location and correction suggestions when user configurations pose risks, compensating for users' lack of understanding of underlying parameter dependencies. Furthermore, the parameter conflict rule matrix in this method adopts a structured design, supporting the addition of new rules through firmware upgrades, allowing security verification capabilities to continuously improve with hardware platform evolution and accumulated operational experience.
[0102] S105. Send the optimal tuning template parameters to the server to be tuned to complete the BIOS parameter configuration adjustment.
[0103] It should be noted that after conflict detection and determining the optimal tuning template, the BIOS parameter combination in the template is actually written into the CPU register to make the parameter configuration effective. However, the one-click tuning does not end after the parameters are issued; it also includes dynamic adjustments based on real-time load and monitoring and tiered response to hardware anomalies.
[0104] It should be noted that the first step is to convert the BIOS parameter combinations in the optimal tuning template into CPU-executable register instructions and write them into the corresponding CPU registers. This parameter delivery involves two types of writing methods: cold write and hot write. Cold write refers to parameter modifications that require a system restart to take effect, including function switches involving CPU core topology or virtualization capabilities, such as Simultaneous Multithreading (SMT) and Virtualization Technology Extensions (VT-x). For such parameters, after the BIOS performs a security check and passes, it does not immediately write the parameter value to the CPU registers. Instead, it records the parameter value in the BIOS's setting variable storage area, waiting for the BIOS to complete the register configuration during the next power-on self-test (POST).
[0105] Hot write refers to parameter modifications that take effect at runtime, including frequency and power consumption-related registers such as P-State limits and cTDP limits. For hot write parameters, the BIOS employs an atomicity guarantee mechanism of batch write plus readback verification: First, CPU interrupts are disabled to ensure the write process is not interrupted by interrupt handlers; then, parameter values are written to the target registers sequentially; after writing, the values of the first few written registers are read back and compared with the expected values; if the readback value matches the expected value, the write is confirmed successful, and interrupts are re-enabled; if the readback value does not match the expected value, the write is considered a failure, the register shadow cache value before modification is restored, and a rollback strategy is executed. This batch write combined with readback verification mechanism ensures that even if an exception occurs during the write process, the CPU registers will not be in an inconsistent state where some parameters are updated and others are not.
[0106] Optionally, if the system becomes unresponsive or the watchdog times out after writing, the BMC detects a BIOS power-on self-test timeout and triggers a CMOS clear signal through a general-purpose input / output pin to force the loading of the previously known good configuration, ensuring that the system can be restored to a safe state at the hardware level.
[0107] After the parameters are issued, the BIOS initiates a dynamic adjustment mechanism. The purpose of this dynamic adjustment is to continuously optimize the issued BIOS parameters based on real-time changes in load after the server is actually running a business load, achieving a real-time balance between performance and power consumption. It's important to note that the parameters in the one-click tuning template are based on static optimal values from offline benchmark tests. However, the load level changes in real time after the server is running a business load. High loads require higher performance to avoid service latency, while low loads can appropriately reduce performance to save power. Maintaining the preset static parameters indefinitely would not achieve a dynamic balance between performance and power consumption.
[0108] Specifically, after sending the optimal tuning template parameters to the server to be tuned, the process includes:
[0109] (1) The BIOS continuously obtains the real-time load data of the CPU from the BMC.
[0110] A periodic data acquisition channel is established between the BIOS and the BMC. The BIOS obtains multi-dimensional load indicators such as real-time CPU utilization, memory bandwidth usage, and I / O throughput through the BMC. The data acquisition frequency is adaptively adjusted according to the magnitude of load changes. For example, a higher acquisition frequency is maintained when the load fluctuates sharply to ensure data timeliness, while the acquisition frequency is reduced when the load is stable to reduce BMC communication overhead.
[0111] Optionally, during dynamic adjustment, the BIOS periodically writes an incrementing heartbeat counter value to a specific address in the BMC. If the BMC does not detect a change in the heartbeat counter value within a preset time, it determines that the BIOS has abnormally hung, and the BMC takes over and forcibly pulls the PROCHOT# pin low. PROCHOT# is the CPU's overheat protection signal pin; pulling this pin low automatically reduces the CPU's frequency and voltage to prevent overheating or uncontrolled power consumption due to lack of management during BIOS malfunctions.
[0112] (2) Based on the CPU real-time load data and the preset dynamic adjustment strategy, the issued BIOS parameters are dynamically adjusted.
[0113] After acquiring real-time load data, the dynamic adjustment strategy automatically determines the adjustment parameters and amounts. First, it calculates the performance change rate, i.e., the rate at which CPU utilization increases or decreases per unit time, to determine the trend and urgency of load changes. When the performance change rate exceeds a preset threshold, it indicates that the load is rapidly increasing or decreasing, and the system needs to intervene in advance. Second, it calculates the power adjustment margin based on the performance change rate; the faster the performance change rate, the larger the power adjustment margin; the slower the performance change rate, the smaller the power adjustment margin. The power adjustment margin defines the maximum allowable adjustment range of cTDP in the current cycle. Finally, P-State and cTDP are combined and optimized as two linked adjustment parameters, aiming for minimum energy consumption and maximum efficiency. Using the power adjustment margin as a constraint, the target P-State adjustment value is automatically determined within the P-State adjustment range, and the target cTDP adjustment value is automatically determined within the cTDP range of PL1 to PL2, under the constraint of not exceeding the power adjustment margin. This ensures that the adjusted parameter combination meets the load performance requirements while minimizing total power consumption and maximizing energy efficiency.
[0114] This automatically determined dynamic adjustment mechanism upgrades the previously fixed rule of increasing parameters under high load and decreasing parameters under low load to an adaptive adjustment that senses load trends in real time based on the rate of performance change, automatically calculates adjustment margins and target values. The introduction of the rate of performance change allows the system to predict load trends and intervene in advance, while the introduction of power consumption adjustment margins prevents over-adjustment or oscillation caused by excessively large single adjustments. The coordinated optimization of P-State and cTDP ensures that the adjustment results achieve the lowest energy consumption and highest efficiency while meeting performance requirements.
[0115] After acquiring real-time load data, the BIOS determines whether to adjust the issued parameters based on a preset dynamic adjustment strategy. The dynamic adjustment strategy addresses how to make the system sufficiently sensitive to load changes without overreacting to momentary load fluctuations. If adjustments are made immediately as soon as the load exceeds a certain threshold, frequent parameter adjustments due to short-term load spikes may cause performance oscillations; conversely, a slow response may fail to address real-time load changes. This embodiment employs a dual-threshold hysteresis debouncing mechanism.
[0116] Specifically, the step of dynamically adjusting the issued BIOS parameters based on the real-time CPU load data and a preset dynamic adjustment strategy includes:
[0117] (i) Calculate the moving average of CPU utilization.
[0118] Let the CPU utilization rate measured in the current sampling period be CurLoad, and the moving average value of the previous period be PrevAvgLoad. Then, the current moving average value AvgLoad = CurLoad × α + PrevAvgLoad × (1 - α), where α is the smoothing coefficient (0 < α < 1). The function of the moving average is to filter out instantaneous spikes and short-term fluctuations in load, so that the adjustment basis reflects the continuous trend of the load rather than instantaneous fluctuations.
[0119] (ii) When the sliding average value exceeds the preset rise threshold N times consecutively, and the duration since the most recent adjustment has exceeded the preset cooling time interval, a performance improvement adjustment is triggered, where N is an integer greater than 1.
[0120] This embodiment employs a dual-threshold design. For example, the rising threshold is set to 85%, and the falling threshold is set to 60%. The reason for using dual thresholds instead of a single threshold is that if the rising and falling triggers use the same threshold (e.g., both 75%), when the load fluctuates around 75%, the system will repeatedly switch between increasing and decreasing, causing performance oscillations. The hysteresis design, with the rising threshold higher than the falling threshold, ensures that the frequency increase is triggered only when the load exceeds 85% from low to high, and the frequency decrease is triggered only when the load falls below 60% from high to low. Maintaining the current parameters unchanged within the intermediate range of 60% to 85% effectively avoids frequent switching near a single threshold.
[0121] Furthermore, even if the moving average exceeds the threshold, the system does not immediately execute adjustments. Instead, it requires multiple consecutive sampling periods to meet the triggering conditions before confirming the trigger. This multiple confirmation mechanism further filters out short-term load fluctuations, ensuring that adjustment decisions are based on a continuous and stable load change trend.
[0122] A forced cooling interval is set between two adjustment actions, starting from the most recent adjustment. The next adjustment is only allowed after the cooling interval has expired. The cooling interval is set to allow the system a stable observation period after adjustment. During the observation period, changes in load can fully reflect the actual effect of the previous adjustment, avoiding over-adjustment caused by continuous adjustments.
[0123] When performance enhancement adjustments are triggered, the BIOS executes parameter adjustments according to a preset coordination order, prioritizing P-State adjustments. Only when P-State reaches its limit is cTDP adjusted. Prioritizing P-State adjustments is chosen because P-State adjustments have low response latency, taking effect simply by modifying the CPU's performance status register, typically with latency in the microsecond range. In contrast, cTDP adjustments involve reconfiguring the power limiter, resulting in a longer response latency. Using faster-responding parameters allows the system to adapt to load changes as quickly as possible.
[0124] The adjustment range of P-State is from P0 to Pn, where P0 is the highest performance state (highest all-core frequency) defined in the CPU hardware specification, and Pn is the lowest performance state (lowest frequency, maximum power saving). The adjustment range of cTDP is from PL1 to PL2, where PL1 is the long-term stable power consumption limit defined in the CPU hardware specification, and PL2 is the short-term maximum power consumption limit defined in the CPU hardware specification.
[0125] Specifically, the BIOS adjusts the CPU's P-State by one step level towards high performance, for example, from P2 to P1. If the current P-State has reached the maximum performance limit P0, but the load is still higher than the preset high load threshold (such as 90%), it means that simply increasing the frequency is no longer sufficient to meet the performance requirements. The system further increases the cTDP by a preset power consumption step (such as 5W) to release more performance by relaxing the power consumption limit.
[0126] (iii) When the sliding average value is lower than the preset decrease threshold for M consecutive times, and the time since the most recent adjustment has exceeded the preset cooling time interval, the power consumption reduction adjustment is triggered, where M is an integer greater than 1.
[0127] It should be noted that the mechanisms for anti-shake confirmation and cooling intervals are consistent with the performance improvement adjustments, and will not be elaborated here.
[0128] When power consumption reduction adjustment is triggered, the BIOS adjusts the P-State by one step level towards lower power consumption, for example, from P1 to P2. The larger the P-State value, the lower the frequency and the lower the power consumption. If the P-State has reached the minimum power consumption limit Pn, and the load is still below the preset low load threshold, then the cTDP is lowered by a preset power consumption step to further reduce the power consumption limit and save energy.
[0129] The adjustment range of P-State is P0 to Pn, as mentioned above, and the adjustment range of cTDP is PL1 to PL2. The specific values of PL1 and PL2 are defined by the CPU hardware specifications and have been loaded as safety boundaries in the S101 root template. Dynamic adjustments will not exceed these hardware safety boundaries.
[0130] It should be noted that the lower limit of P-State adjustment does not exceed Pn, and the upper limit does not exceed P0; the upper limit of cTDP does not exceed PL2, where PL2 is the short-term maximum power consumption limit defined in the CPU hardware specification, and the lower limit of cTDP is not lower than PL1, where PL1 is the long-term stable power consumption limit defined in the CPU hardware specification. These upper and lower limit constraints are defined in the S101 root template to ensure that dynamic adjustments do not exceed the hardware safety boundary due to over-adjustment.
[0131] Optionally, dynamic adjustments are only valid during the current session. After a power outage, the adjusted values are lost and revert to the default values preset in the template. This is because dynamic adjustments are temporary optimizations based on the actual load conditions during the current session, rather than permanent configuration changes. The load characteristics of the next session may be completely different, making it more reasonable to start tuning from the template default values again.
[0132] In addition to dynamic adjustments, the system also simultaneously activates an abnormal state monitoring mechanism after the parameters are issued. Dynamic adjustments address the balance between performance and power consumption, while abnormal state monitoring addresses security issues. Even if the tuning parameters have passed S104 conflict pre-checking, the server may still experience hardware anomalies not covered by the template parameters during actual operation. These anomalies include sudden temperature increases due to cooling system failures, power overload due to power supply anomalies, and a large number of correctable errors in the memory.
[0133] Specifically, after sending the optimal tuning template parameters to the server to be tuned, the process further includes:
[0134] (1) Continuously monitor the hardware operating status of the server.
[0135] Anomaly monitoring is performed collaboratively by the BMC and BIOS. The BMC continuously reads CPU temperature via the PECI bus, power consumption data via the VRM, and corrects memory error counts by polling the MCA Bank. The BIOS obtains the above monitoring data from the BMC via out-of-band management communication protocols.
[0136] The detection of abnormal events is based on the following trigger thresholds: For overheat protection, when the PECI temperature DTS value read by the BMC is lower than the preset threshold, it is determined to be an overheating abnormality. The DTS value represents the difference from the maximum node temperature (Tjmax). The smaller the value, the closer the temperature is to the limit. For power consumption overload, when the VRM current continues for a preset time (e.g., 500ms) and exceeds the preset proportion of the rated power of the power supply unit (PSU) (e.g., 105%), it is determined to be a power consumption overload. For memory correctable error storm, when the correctable error count per minute polled by the MCA Bank exceeds the preset threshold (e.g., 100 times), it is determined to be a memory CE storm.
[0137] (2) When a hardware abnormal event is detected, a graded response operation is performed according to the severity of the abnormal event, wherein the graded response operation includes first-level alarm processing, second-level degradation processing, third-level pause tuning processing and fourth-level emergency rollback processing.
[0138] It should be noted that the tiered response operation includes Level 1 alarm handling, Level 2 degradation handling, Level 3 pause and optimization handling, and Level 4 emergency rollback handling. There is a two-way channel between the tiered responses: progressive escalation and automatic recovery. A lower-level response will escalate to a higher-level response if the anomaly persists, and a higher-level response will automatically degrade and recover after the anomaly is resolved and the situation has remained stable for a preset time.
[0139] The system employs three levels of alarm handling. Level 1 alarm handling occurs when a detected anomaly, while not yet affecting normal system operation, requires attention. The system only records the BMC SEL log, including the anomaly type, trigger time, and current sensor values, without adjusting parameters. Maintenance personnel can use the logs to trace the timeline and frequency of the anomaly. Level 2 degradation handling occurs when an anomaly poses a potential risk but is not yet critical. The system forcibly limits P-State or cTDP to a preset safe range. For example, when the CPU temperature DTS value is below a preset threshold, the upper limit of P-State is forcibly reduced by two levels, and cTDP is reduced to a preset value, mitigating overheating by reducing frequency and power consumption. The degradation state automatically resumes after the anomaly conditions are resolved. Level 3 tuning pause handling occurs when the anomaly continues to worsen or frequently triggers Level 2 responses. The system freezes the current dynamic adjustment operations and locks the current parameters, prohibiting further adjustments. The purpose of tuning pause is to prevent the dynamic adjustment mechanism from continuing to change parameters and complicate the problem under abnormal conditions until the root cause of the anomaly is located and fixed. The fourth-level emergency rollback process occurs when a serious hardware anomaly is detected (such as temperature approaching hardware limits or power consumption severely exceeding limits). In this event, the system triggers a system management interrupt, and the BIOS restores the CPU registers to a preset safe default configuration. The safe default configuration is a fully validated set of minimum risk parameters that ensures the hardware itself will not suffer permanent damage even if services are interrupted.
[0140] The method provided in this embodiment constructs a four-level hierarchical tuning template library with a mapping relationship between platform identifiers and CPU models. Based on offline benchmark testing and multiple linear regression fitting, it generates parameter combinations adapted to different scenarios, ensuring the reliability and platform compatibility of template parameters from the source. Before the parameters are issued, real-time hardware operating status such as CPU temperature, power consumption, load, and health status is obtained from the BMC via out-of-band management communication protocol. A preset parameter conflict rule matrix is called to cross-validate the target template, which can effectively detect and block high-risk configurations such as high-temperature enabling of high power limits and virtualization functions not being synchronously enabled with IOMMU, eliminating security risks before the parameters take effect. At the same time, it blocks the issuance and writes conflict items back to the BI. The closed-loop processing of the OS settings interface provides users with precise correction guidance rather than simple error exits. After parameters are issued, a multi-layered anti-jitter dynamic adjustment mechanism based on sliding average, dual-threshold hysteresis interval, continuous multiple confirmations, and forced cooling intervals achieves real-time performance and power consumption balance with P-State priority adjustment and cTDP assistance while avoiding performance oscillations, taking into account both optimization response speed and system stability. In addition, through a four-level hierarchical response anomaly control mechanism from alarm recording, degradation limitation, suspension of optimization to emergency rollback, a full-link closed-loop guarantee from security verification, dynamic optimization to fault protection is provided for the entire optimization process, significantly reducing the technical threshold and risk of misoperation for maintenance personnel manually configuring BIOS parameters.
[0141] Example 2
[0142] Corresponding to the aforementioned embodiment of a one-click server BIOS tuning method, this application also provides an embodiment of a one-click server BIOS tuning device.
[0143] Figure 2 This is a schematic diagram of Embodiment 2 of the server BIOS one-click tuning device provided in this application. Please refer to... Figure 2 The apparatus provided in this embodiment includes a creation module 210, a determination module 220, a detection module 230, and a processing module 240;
[0144] The creation module 210 is used to create a tuning template library, which includes multiple tuning templates. Each tuning template has a unique platform identifier, and the platform identifier and CPU model have a mapping relationship.
[0145] The determining module 220 is used to determine a set of candidate tuning templates in the tuning template library platform layer based on the CPU model of the server to be tuned and the mapping relationship;
[0146] The determining module 220 is further configured to determine the target optimization template from the candidate optimization template set based on the scenario instructions;
[0147] The detection module 230 is used to perform conflict detection on the target tuning template based on the real-time hardware status of the server to be tuned, and to determine the optimal tuning template based on the conflict detection results.
[0148] The processing module 240 is used to send the optimal tuning template parameters to the server to be tuned to complete the BIOS parameter configuration adjustment.
[0149] The apparatus of this embodiment can be used to perform... Figure 1 The steps of the method embodiment shown are similar in principle and process, and will not be repeated here.
[0150] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0151] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0152] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A server BIOS one-key tuning method, characterized in that, The method includes: Create a tuning template library, which includes multiple tuning templates. Each tuning template has a unique platform identifier, and the platform identifier has a mapping relationship with the CPU model. Based on the CPU model of the server to be tuned and the mapping relationship, a set of candidate tuning templates is determined in the tuning template library platform layer; The target optimization template is determined from the candidate optimization template set based on the scenario instructions; Based on the real-time hardware status of the server to be tuned, conflict detection is performed on the target tuning template, and the optimal tuning template is determined based on the conflict detection results. The optimal tuning template parameters are sent to the server to be tuned to complete the BIOS parameter configuration adjustment.
2. The method of claim 1, wherein, The creation of the tuning template library includes: Construct a four-level hierarchical template structure, the four-level hierarchical template structure including: The root template defines the basic register operation permissions and security boundaries; The platform template inherits from the root template and populates the default register address table of the CPU platform. The scenario template inherits from the platform template and sets corresponding parameter values for different application scenarios. A custom template is used, which inherits from the scenario template and uses a copy-on-write strategy to store the user-modified difference parameters.
3. The method according to claim 1, characterized in that, The creation of the tuning template library includes: Determine the platform template that the scene template inherits from, and define the scene optimization goals; Run a standard workload on the corresponding CPU platform and record performance data under different parameter combinations; The weight coefficients of each parameter to the scene optimization objective are calculated using a regression algorithm; The optimal set of parameters for a given scenario is determined by sorting and filtering the parameter combinations based on weighting coefficients.
4. The method according to claim 1, characterized in that, The conflict detection of the target tuning template based on the real-time hardware status of the server to be tuned includes: The real-time CPU operating status of the server to be tuned is obtained from the BMC via out-of-band management communication protocol; The preset parameter conflict rule matrix is invoked to match the BIOS parameter combination in the target tuning template with the real-time operating status of the CPU to determine whether there is a configuration conflict. When a configuration conflict exists, the parameter distribution process is blocked, a prompt message containing specific information about the conflicting item is generated, and the conflicting item is written back to the BIOS settings interface.
5. The method according to claim 1, characterized in that, After the optimal tuning template parameters are sent to the server to be tuned, the following steps are included: The BIOS continuously obtains real-time load data of the CPU from the BMC; Based on the CPU real-time load data and the preset dynamic adjustment strategy, the issued BIOS parameters are dynamically adjusted.
6. The method according to claim 5, characterized in that, The step of dynamically adjusting the issued BIOS parameters based on the CPU real-time load data and a preset dynamic adjustment strategy includes: Calculate the moving average of CPU utilization; When the sliding average exceeds the preset rise threshold N times consecutively, and the time since the most recent adjustment has exceeded the preset cooling time interval, a performance improvement adjustment is triggered, where N is an integer greater than 1. When the sliding average value is lower than the preset decrease threshold for M consecutive times, and the time since the most recent adjustment has exceeded the preset cooling time interval, a power consumption reduction adjustment is triggered, where M is an integer greater than 1.
7. The method according to claim 6, characterized in that, The method includes: When a performance boost adjustment is triggered, the CPU's P-State performance is adjusted to a higher performance level by one step. When the P-State has reached the maximum performance limit and the load is still higher than the preset high load threshold, the CPU's configurable thermal design power (cTDP) is increased by a preset power consumption step. When power consumption reduction adjustment is triggered, P-State is adjusted one step size towards lower power consumption; when P-State has reached the minimum power consumption limit and the load is still lower than the preset low load threshold, cTDP is reduced by a preset power consumption step size.
8. The method according to claim 1, characterized in that, The real-time hardware status includes: CPU temperature, CPU power consumption, CPU load, and CPU health status.
9. The method according to claim 1, characterized in that, After sending the optimal tuning template parameters to the server to be tuned, the process further includes: Continuously monitor the hardware operating status of the server; When a hardware anomaly is detected, a tiered response operation is performed based on the severity of the anomaly. The tiered response operation includes a first-level alarm handling, a second-level degradation handling, a third-level pause tuning handling, and a fourth-level emergency rollback handling.
10. A one-click BIOS tuning device for servers, characterized in that, The device includes a creation module, a determination module, a detection module, and a processing module; The creation module is used to create a tuning template library, which includes multiple tuning templates. Each tuning template has a unique platform identifier, and the platform identifier and CPU model have a mapping relationship. The determining module is used to determine a set of candidate tuning templates in the tuning template library platform layer based on the CPU model of the server to be tuned and the mapping relationship; The determining module is further configured to determine the target optimization template from the candidate optimization template set based on the scenario instructions; The detection module is used to perform conflict detection on the target tuning template based on the real-time hardware status of the server to be tuned, and to determine the optimal tuning template based on the conflict detection results. The processing module is used to send the optimal tuning template parameters to the server to be tuned to complete the BIOS parameter configuration adjustment.