Server fault information recording method and apparatus, and computer device and storage medium
By employing an asynchronous processing strategy in the baseboard management controller to create asynchronous tasks for recording fault information, the server performance degradation and jitter issues caused by correctable errors are resolved, thereby improving server stability and response speed.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2025-06-10
- Publication Date
- 2026-06-25
AI Technical Summary
Frequent correctable error (CE) logs of fault information lead to decreased server performance, increased network jitter, and response latency, impacting server operations.
The Asynchronous Processing Strategies (AHCEE) of the Baseboard Management Controller (BMC) are adopted. By creating asynchronous tasks within the BMC, the time when the CPU frequently enters the system management mode is reduced, and the fault information recording process is optimized.
It effectively reduces the processing time for correctable errors, minimizes the impact on server performance, avoids network jitter and latency issues, and improves the server's business processing capabilities.
Smart Images

Figure CN2025100223_25062026_PF_FP_ABST
Abstract
Description
Server fault information recording methods, devices, computer equipment, and storage media
[0001] Cross-references to related applications
[0002] This application claims priority to Chinese Patent Application No. 202411890480.3, filed on December 20, 2024, entitled “Method, Apparatus, Computer Equipment and Storage Medium for Recording Server Fault Information”, the entire contents of which are incorporated herein by reference. Technical Field
[0003] This application relates to the field of computer technology, and in particular to a method, apparatus, computer equipment, and storage medium for recording server fault information. Background Technology
[0004] Errors in memory and PCIe (Peripheral Component Interconnect Express, a high-speed serial computer expansion bus standard) devices can be categorized into correctable errors (CE) and uncorrectable errors (UCE). Uncorrectable errors typically cause server downtime, requiring on-site maintenance personnel to replace the components. Correctable errors can be self-corrected by the server system, after which the server system can be used normally.
[0005] Frequent fault codes (CEs) can severely impact server system performance because after a CE is repaired, the server system needs to record the CE's fault information. For example, the Baseboard Management Controller (BMC) records System Event Log (SEL) and black-box logs to the BMC Flash memory. This process of recording CE fault information consumes CPU resources. Recording each fault message requires the CPU to enter System Management Mode (SMM) to process the fault information sent from the Basic Input Output System (BIOS) to the BMC. Only after the BMC records the fault information and returns the processing result can the CPU exit SMM mode and continue processing normal business data.
[0006] If the BIOS reports a large number of CE (Computer Error) fault messages, the CPU will need to frequently enter SMM (Service Mode) mode to process these fault messages. This will lead to a decrease in server performance, increased network jitter, and response latency, thus affecting server services.
[0007] Therefore, the related technologies have the problem that recording CE fault information can lead to a decrease in server performance, resulting in increased network jitter and response latency. Summary of the Invention
[0008] In view of this, this application provides a method, apparatus, computer device, and storage medium for recording server fault information.
[0009] In a first aspect, this application provides a server fault information recording method, which is applied to a baseboard management controller and includes:
[0010] In response to determining that a preset number of fault information recording requests have been received from the basic input / output system, a target task corresponding to each fault information recording request is created, wherein the target task is used to record the fault information corresponding to the fault information recording requests; and
[0011] The system sends the corresponding processing result of the fault information recording request to the basic input / output system and executes the target task to complete the recording of fault information. The processing result is used to instruct the basic input / output system to control the central processing unit to exit the preset mode and continue processing business data.
[0012] In some implementations, the target task corresponding to the fault information logging request includes:
[0013] Perform data integrity verification on the fault information logging request; and
[0014] If the fault information logging request verification passes, then the target task is created.
[0015] In this embodiment, by performing data integrity verification on the fault information recording request, it can be verified whether the format of the fault information recording request is correct, whether the parameters are valid, and whether it contains the necessary permission information.
[0016] In some implementations, data integrity verification is performed on the fault information logging request, including:
[0017] Obtain the structure data corresponding to the fault information recording request, and determine whether the format of the structure data is the same as the preset format. The preset format is used to verify the permission information of the fault information recording request.
[0018] In response to determining the format and preset format of the structure data, the parameter types and number of parameters of the structure data are obtained; and
[0019] If the parameter type is found to be the same as the preset type and the number of parameters is the same as the preset value, the fault information recording request verification passes.
[0020] In this embodiment, the fault information recording request is checked for data integrity using a preset format, preset type, and preset value to ensure that the fault information recording request is in the correct format, has valid parameters, contains necessary permission information, and has data integrity and parameter number integrity.
[0021] In some implementations, the target task corresponding to the fault information logging request includes:
[0022] Based on the fault information recording request, create an asynchronous task inside the baseboard management controller and use the asynchronous task as the target task;
[0023] Add the target task to the task queue and retrieve the priority of the target task in the task queue; and
[0024] The execution order of target tasks in the task queue is determined based on priority.
[0025] In some implementations, performing the target task includes:
[0026] Based on the task scheduler and execution order, determine the tasks to be executed in the task queue;
[0027] The system calls an information processing object to process the task to be executed, obtains the processed information, and calls a logging object to generate a fault information log based on the processed information; and
[0028] Fault information logs are saved to the flash memory of the baseboard management controller.
[0029] In some implementations, after performing the target task, the method further includes:
[0030] Obtain the execution results of the target task;
[0031] Obtain log fragments based on the execution results and write these fragments to the fault information log; and
[0032] In response to a determination that the execution of the target task has failed based on the execution result, an error response is generated and returned to the target object.
[0033] In this embodiment, the execution results are converted into log fragments and written to the fault information log. The execution status of each target task can be determined through the fault information log.
[0034] In some implementations, the method further includes:
[0035] Obtain process information for executing the target task; and
[0036] Based on the process information and fault information recording requests, generate a process log for executing the target task.
[0037] In this embodiment, a process log for executing the target task is generated based on the process information and fault information recording request for executing the target task, and the entire AHCEE processing process is recorded using the process log.
[0038] In some implementations, after performing the target task, the method further includes:
[0039] Determine the occupied resources corresponding to the target task in the baseboard management controller; and
[0040] Release occupied resources.
[0041] In some implementations, the method further includes:
[0042] During the execution of the target task, it is determined whether a preset error occurred in the baseboard management controller; and
[0043] In response to the determination that a preset error has occurred in the baseboard management controller, an error message is generated and output.
[0044] In some implementations, the target task corresponding to the fault information logging request includes:
[0045] Verify that the fault information logging request contains the necessary data and parameters; and
[0046] In response to the determination that the fault information recording request contains the necessary data and parameters, an asynchronous task is created inside the baseboard management controller, and the asynchronous task is used as the target task corresponding to the fault information recording request.
[0047] In some implementations, the default mode is system management mode.
[0048] In some implementations, performing data integrity verification on fault information logging requests includes:
[0049] Obtain a reference check value, which is calculated by the basic input / output system using a data integrity verification algorithm based on the fault information recording request; and
[0050] The baseboard controller uses a data integrity verification algorithm to calculate the current verification value of the fault information recording request, and compares the current verification value with the reference verification value to perform data integrity verification on the fault information recording request.
[0051] Secondly, this application provides a method for recording server fault information, which is applied to a basic input / output system and includes:
[0052] Obtain server fault information and generate a preset number of fault information record requests based on the fault information;
[0053] A preset number of fault information recording requests are sent to the baseboard management controller. These requests instruct the baseboard management controller to create a target task. The baseboard management controller then sends the processing result corresponding to the fault information recording request to the basic input / output system and executes the target task, which records the fault information corresponding to the fault information recording request.
[0054] Upon receiving confirmation of the processing result, the central processing unit is controlled to exit the preset mode and continue processing business data.
[0055] The server fault information recording method provided in this embodiment involves the Basic Input / Output System (BIOS) generating a preset number of fault information recording requests based on the fault information, and sending these requests to the Baseboard Management Controller (BMC), instructing the BMC to record the fault information. Upon receiving the fault information recording requests, the BMC first creates a target task and then returns the processing result to the BIOS, enabling the Central Processing Unit (CPU) to exit the preset mode as quickly as possible.
[0056] In some implementations, obtaining server fault information and generating a preset number of fault information record requests based on the fault information includes:
[0057] Monitor the server to determine if there are any correctable errors on the server.
[0058] In response to the determination that a correctable error exists in the server, obtain the fault information of the correctable error;
[0059] Determine whether the number of fault messages is greater than or equal to a preset threshold; and
[0060] In response to the determination that the number of fault information is greater than or equal to a preset threshold, a preset number of fault information record requests are generated based on the fault information.
[0061] In some implementations, the target task corresponding to the fault information logging request includes:
[0062] Verify that the fault information logging request contains the necessary data and parameters; and
[0063] In response to the determination that the fault information recording request contains the necessary data and parameters, an asynchronous task is created inside the baseboard management controller, and the asynchronous task is used as the target task corresponding to the fault information recording request.
[0064] In some implementations, the default mode is system management mode.
[0065] In this embodiment, the BIOS is modified to reduce the frequency of CE reporting for memory in the same slot and at the same address by setting a threshold.
[0066] Thirdly, this application provides a server fault information recording device, which is deployed on a baseboard management controller and includes:
[0067] The task creation module is used to, in response to receiving a preset number of fault information recording requests from the basic input / output system, create target tasks corresponding to the fault information recording requests, wherein the target tasks are used to record the fault information corresponding to the fault information recording requests; and
[0068] The task execution module is used to send the processing result corresponding to the fault information recording request to the basic input / output system, and execute the target task to complete the recording of fault information. The processing result is used to instruct the basic input / output system to control the central processing unit to exit the preset mode and continue processing business data.
[0069] Fourthly, this application provides a server fault information recording device, which is deployed in a basic input / output system and includes:
[0070] The information acquisition module is used to acquire server fault information and generate a preset number of fault information record requests based on the fault information.
[0071] The information sending module is used to send a preset number of fault information recording requests to the baseboard management controller. Each fault information recording request instructs the baseboard management controller to create a target task. The baseboard management controller then sends the processing result corresponding to the fault information recording request to the basic input / output system and executes the target task. The target task records the fault information corresponding to the fault information recording request.
[0072] The control module is used to respond to the confirmation that a processing result has been received, and to control the central processing unit to exit the preset mode and continue processing business data.
[0073] Fifthly, this application provides a computer device, comprising:
[0074] One or more processors; and
[0075] A memory associated with one or more processors, the memory being used to store computer-readable instructions, which, when read and executed by one or more processors, implement the server fault information recording method as described in the first aspect above or any corresponding embodiment thereof, or implement the server fault information recording method as described in the second aspect above or any corresponding embodiment thereof.
[0076] In a sixth aspect, this application provides a non-volatile computer-readable storage medium storing computer-readable instructions. When executed by one or more processors, the computer-readable instructions implement the server fault information recording method of the first aspect or any corresponding embodiment described above, or implement the server fault information recording method of the second aspect or any corresponding embodiment described above.
[0077] In a seventh aspect, this application provides a computer program product, including computer-readable instructions, which, when executed by one or more processors, implement the server fault information recording method of the first aspect or any corresponding embodiment described above, or implement the server fault information recording method of the second aspect or any corresponding embodiment described above. Attached Figure Description
[0078] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0079] Figure 1 is a schematic diagram of the current CE reporting process according to an embodiment of this application;
[0080] Figure 2 is a flowchart illustrating a server fault information recording method applied to a baseboard management controller according to an embodiment of this application;
[0081] Figure 3 is a flowchart illustrating a server fault information recording method applied to a basic input / output system according to an embodiment of this application;
[0082] Figure 4 is a flowchart of an optimization method for reducing network jitter and latency of a server caused by a CE storm, according to an embodiment of this application.
[0083] Figure 5 is a structural block diagram of a server fault information recording device deployed on a baseboard management controller according to an embodiment of this application;
[0084] Figure 6 is a structural block diagram of a server fault information recording device deployed in a basic input / output system according to an embodiment of this application;
[0085] Figure 7 is a schematic diagram of the hardware structure of a computer device according to an embodiment of this application;
[0086] Figure 8 is a schematic diagram of the structure of a non-volatile computer-readable storage medium according to an embodiment of this application. Detailed Implementation
[0087] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0088] Memory is a critical component that directly caches data for the central processing unit (CPU). If memory fails, the processor will stop responding, causing the system to crash. Memory capacity has gradually increased, from hundreds of megabytes (MB) to hundreds of gigabits (Gb). To achieve this increased capacity, memory storage density has increased, memory voltage has decreased, and conversely, memory clock speeds have increased. As a semiconductor device, the increased density of storage cells, decreased voltage, and increased clock speed have sacrificed memory reliability for improved performance, leading to a higher failure rate. The same applies to PCIe (Peripheral Component Interconnect Express) devices. PCIe device transmission rates have continuously increased, from 5 gigabits (GT / s) to 8 GT / s and then to 16 GT / s. Due to the increased transmission speeds between devices and the increasingly complex topologies of the links, errors are more likely to occur during data transmission and reception in PCIe devices.
[0089] Therefore, it is necessary to address memory and PCIe device failures promptly. Errors triggered by memory and PCIe devices can be categorized into correctable errors (CE) and uncorrectable errors (UCE). Uncorrectable errors generally lead to server downtime, requiring on-site maintenance personnel to replace the components. Correctable errors, on the other hand, can be self-repaired by the system, allowing normal system operation after repair. However, frequent CEs can severely impact server system performance. This is because after a correctable error is repaired by hardware, the server system needs to record hardware error information, such as the BMC recording SEL (System Event Log) logs and black box logs to the BMC Flash memory. This process of recording fault information consumes CPU resources. Each CE record requires the CPU to enter SMM mode to process the fault data sent by the BIOS to the BMC, and wait for the BMC to process the log record and return the processing result before exiting SMM mode to continue processing normal business data. If the hardware reports a large number of CE errors, the CPU needs to frequently enter SMM mode to process these fault information reports, leading to a decrease in server performance, manifested in network jitter and latency, which is detrimental to business operations. It's important to note that SMM (System Management Mode) is a standardized architectural feature. When an external SMM interrupt pin (SMI#) is activated or an SMI (System Management Interrupt) is received from the APIC (Advanced Programming Interrupt Controller), the processor enters SMM. In SMM, while preserving the entire context of the currently running program, the processor switches to a separate address space. The code specified by SMM can then be executed transparently. Upon returning from SMM, the processor returns to the state it was in before the system management interrupt. The BIOS is a program embedded in a ROM (Read-Only Memory) on the motherboard inside the computer. It stores the computer's most important basic input / output programs, power-on self-test programs, and system startup programs. It can read and write system settings information from CMOS (Complementary Metal Oxide Semiconductor). Its main function is to provide the computer with the lowest-level and most direct hardware settings and control. The BIOS uses the built-in FW (Firmware) and sensors distributed on the baseboard, system board, and chassis to perform management functions such as data collection, event logging, error diagnosis, and troubleshooting.
[0090] The current CE reporting process is shown in Figure 1. During system operation, the BIOS, in conjunction with hardware monitoring of memory or PCIe devices, can correct error reports. Upon confirming the detection of a CE, the server system triggers SMI, informing the CPU to enter SMM mode to handle the correctable error reporting process. The entire process includes: the BIOS processing the fault information and sending a notification to the BMC via KCS (Keyboard Controller Style); the BMC receiving the fault information from the BIOS through the KCS interface; the BMC generating a SEL record and triggering the information processing program; the BMC calling the log writing program to write to Flash, recording relevant error logs to the BMC Flash media; after completing the Flash writing, the write processing result is returned, and a successful write result is returned to the information processing program, which then returns to the KCS processing program. At this point, the BIOS recognizes that the BMC log processing and writing process are complete, informing the CPU that the SMI interrupt handling is complete, and the CPU exits SMM mode. In the above CE reporting process, the error correction process is designed to be quite lengthy, including BIOS-side processing and BMC-side processing. The BMC processing is particularly important, as it is affected by the BMC's processing speed, the current task scheduling, and the Flash write time. In extreme cases, the overall processing time can exceed 1 second (1000ms). If SMI occurs frequently, the delay and jitter will be very noticeable, resulting in a poor customer experience.
[0091] Based on the above, this application provides a server fault information recording method, which reduces the impact of correctable errors from memory and PCIe devices on system performance through two strategies. First, it reduces the frequency of CE (Error Correction) reports; second, it reduces the processing time for CE reports. Reducing the CE report frequency can be achieved through BIOS modifications, setting thresholds to converge the frequency of CE reports for memory in the same slot and at the same address. Reducing the CE report time is achieved by optimizing the overall CE report SMI (Signal Misprocessing) processing time to decrease the CPU's time to enter SMM (Signal Misprocessing Mode). The BMC KCS (Browser Control Center) asynchronous processing strategy, AHCEE (Asynchronous Handle CE Error), is employed to effectively reduce the overall SMI processing time and minimize the impact of correctable errors from components on service performance. This ultimately reduces the processing time of correctable errors from memory components and PCIe devices, thus mitigating the impact of frequent correctable error reports on server performance.
[0092] According to an embodiment of this application, a server fault information recording embodiment is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in the baseboard management controller. Although a logical order is shown in the flowchart, in some cases, the steps shown or described can be executed in a different order than that shown here.
[0093] This embodiment provides a server fault information recording method, which can be used in a baseboard management controller. Figure 2 is a flowchart of the server fault information recording method according to an embodiment of this application. As shown in Figure 2, the process includes the following steps:
[0094] Step S201: In response to determining that a preset number of fault information recording requests have been received from the basic input / output system, a target task corresponding to the fault information recording request is created, wherein the target task is used to record the fault information corresponding to the fault information recording request.
[0095] In some embodiments, this embodiment adopts the BMC KCS asynchronous processing strategy AHCEE (asynchronous handle CE Error), which effectively reduces the overall processing time of SMI and reduces the impact of correctable errors of components on service performance.
[0096] The Basic Input / Output System (BIOS) monitors memory and PCIe devices for CE (Error Detection) messages via the memory controller. By modifying the BIOS to reduce the frequency of CE reporting, and setting a threshold for the frequency of errors from the same memory slot and address, the CE reporting frequency can be converged. With this modification, after the BIOS detects a CE error, it will generate a preset number of fault information record requests containing CE fault information once the number of CE errors reaches the threshold. The preset number is, for example, 5, 6, 7… The number can be set according to actual needs.
[0097] The BIOS sends a preset number of fault information recording requests to the BMC via the KCS. The BMC receives the preset number of fault information recording requests from the BIOS via the KCS interface. The KCS module is a submodule of the BMC LPC (Low Pin Count) controller and is used for communication between the BIOS and the BMC.
[0098] After receiving a preset number of fault information recording requests, the BMC initiates the AHCEE processing flow to verify whether the fault information recording requests contain the necessary data and parameters, such as data indicating the operations to be performed. If these data and parameters are included, the AHCEE processing flow creates an asynchronous task within the BMC, using this asynchronous task as the target task corresponding to the fault information recording request. The target task is then used to record the fault information corresponding to the fault information recording request. The target task is added to the task queue, awaiting processing.
[0099] Step S202: Send the processing result corresponding to the fault information recording request to the basic input / output system, and execute the target task to complete the recording of fault information. The processing result is used to instruct the basic input / output system to control the central processing unit to exit the preset mode and continue processing business data.
[0100] In some embodiments, after the BMC creates the target task, it returns the SMI processing result to the BIOS, i.e., sends the processing result corresponding to the fault information recording request to the BIOS. Upon receiving the processing result, the BIOS controls the CPU to exit a preset mode, such as SMM mode. After exiting the preset mode, the CPU continues processing business data. Simultaneously with the BMC returning the processing result to the BIOS, it invokes the asynchronous fault log information processing flow to execute the target task and record fault log information.
[0101] It should be noted that the average processing time of the BIOS-side code in the fault reporting process in this embodiment is about 10ms, while the processing time of the BMC is affected by the BMC's processing speed, the current task scheduling, and the Flash write time. In extreme cases, the overall processing time can exceed 1000ms. This solution adopts an asynchronous processing strategy to isolate the longer part of the BMC processing. In actual testing, the overall average time to correct error reporting has been reduced to about 10ms. That is, after optimization, the CPU's time to enter SMM processing has been reduced to about 10ms, and the CE fault handling performance has been improved by about 100 times. When customers deliberately trigger CE storms in a real environment, various performance tests show that it will not cause delays or jitter to the customer's business and applications.
[0102] In addition, this application has been fully deployed on cloud servers in CDC (Cloud Dedicated Cluster) data centers and is playing a good role, greatly reducing the impact of correctable component errors on server performance.
[0103] The server fault information recording method provided in this embodiment involves the baseboard management controller (BDC) first creating a target task after receiving a fault information recording request, and then returning the processing result to the basic input / output system (BIS) to allow the central processing unit (CPU) to exit the preset mode as quickly as possible. Simultaneously, the target task is executed asynchronously to complete the fault information recording. This overall optimization reduces the time the CPU remains in the preset mode during fault information recording, minimizing the impact of correctable component errors on service performance. It also solves the problem that recording CE fault information leads to server performance degradation, increased network jitter, and response latency.
[0104] In some implementations, creating a target task corresponding to a fault information recording request includes: performing data integrity verification on the fault information recording request; and creating the target task in response to determining that the fault information recording request has passed verification.
[0105] In some embodiments, data integrity verification is performed on fault information recording requests. For example, when generating a fault information recording request, the BIOS calculates and saves a reference check value corresponding to the fault information recording request using a data integrity verification algorithm. When the BMC verifies the fault information recording request, it calculates the current check value of the fault information recording request using the same data integrity verification algorithm, compares the reference check value with the current check value, and determines whether the fault information recording request is complete. Data integrity verification algorithms include, for example, checksum algorithms, cyclic redundancy check (CRC) algorithms, and hash functions. A checksum is a simple numerical value generated by performing some operation (such as summation, XOR, etc.) on the data. The receiver can recalculate the checksum using the same algorithm and compare it with the checksum provided by the sender to verify data integrity. CRC is a more powerful verification method that generates a checksum based on polynomial division. A hash function maps data of arbitrary length to a fixed-length hash value (also called a digest). The receiver can recalculate the hash value using the same hash function and compare it with the hash value provided by the sender to verify data integrity.
[0106] In this embodiment, by performing data integrity verification on the fault information recording request, it can be verified whether the format of the fault information recording request is correct, whether the parameters are valid, and whether it contains the necessary permission information.
[0107] In some implementations, data integrity verification of the fault information recording request includes: obtaining the structure data corresponding to the fault information recording request; determining whether the format of the structure data is the same as a preset format, wherein the preset format is used to verify the permission information of the fault information recording request; in response to determining that the format of the structure data is the same as the preset format, obtaining the parameter type and parameter quantity of the structure data; in response to determining that the parameter type is the same as the preset type and the parameter quantity is the same as the preset value, the fault information recording request verification passes.
[0108] In some embodiments, the BIOS sends the structure data corresponding to the fault information logging request to the BMC via the KCS channel. The BMC needs to verify the data integrity and parameter count integrity of the structure data.
[0109] A preset format is used to verify the permission information of the fault information recording request. The preset format includes the previously defined structure content and format. The structure data is read, and it is determined whether the structure data format is the same as the preset format. If the format of the structure data is confirmed to be the same as the preset format, the fault information recording request is deemed to have a correct format, valid parameters, and contains the necessary permission information. Otherwise, the fault information recording request does not meet the requirements of a correct format, valid parameters, and necessary permission information.
[0110] Further verify the data integrity and parameter count integrity of the fault information recording request. Obtain the parameter type and parameter quantity of the structure data. In response to determining that the parameter type is the same as the preset type and the parameter quantity is the same as the preset value, it is determined that the fault information recording request has data integrity and parameter count integrity, and the fault information recording request verification passes.
[0111] In this embodiment, the fault information recording request is checked for data integrity using a preset format, preset type, and preset value to ensure that the fault information recording request is in the correct format, has valid parameters, contains necessary permission information, and has data integrity and parameter number integrity.
[0112] In some implementations, creating a target task corresponding to a fault information recording request includes: creating an asynchronous task within the baseboard management controller based on the fault information recording request, and using the asynchronous task as the target task; adding the target task to a task queue and obtaining the priority of the target task in the task queue; and determining the execution order of the target tasks in the task queue based on the priority.
[0113] In some embodiments, based on a fault information logging request, an asynchronous task is created within the BMC, and this asynchronous task is used as the target task. The target task is then added to the task queue, awaiting processing.
[0114] Obtain the priority of target tasks in the task queue. For example, determine the priority of target tasks in the task queue based on factors such as resource availability. Resource availability is calculated by assessing the current CPU core processing load during the integrity verification of fault information recording requests. Based on the priority of the target tasks, determine the execution order of the target tasks in the task queue.
[0115] In this embodiment, an asynchronous task is created within the baseboard management controller based on the fault information recording request. An asynchronous processing strategy is adopted to isolate the process part of the BMC recording error information pairs, shorten the time the central processing unit is in the preset mode, and reduce the impact of correctable errors of components on service performance.
[0116] In some implementations, executing the target task includes: determining the task to be executed in the task queue according to the task scheduler and the execution order; calling an information processing object to process the task to be executed, obtaining processed information, and calling a log recording object to generate a fault information log based on the processed information; and saving the fault information log to the flash memory of the baseboard management controller.
[0117] In some embodiments, AHCEE performs asynchronous operations by calling corresponding business logic code or external services to execute the actual asynchronous operations. The BMC task scheduler retrieves target tasks from the task queue in sequence according to the execution order of the target tasks in the task queue and treats the target tasks as tasks to be executed.
[0118] Information processing objects, for example: AHCEE information processing module; log recording objects, for example: AHCEE log recording processing module.
[0119] The system calls the information processing object to process the task to be executed, obtains the processed information, and calls the log recording object to generate a fault information log based on the processed information; the fault information log is then saved to the flash memory of the baseboard management controller.
[0120] In this embodiment, an asynchronous processing strategy is adopted to isolate the process of BMC recording error information pairs, thereby shortening the time the central processing unit is in the preset mode and reducing the impact of correctable errors of components on service performance.
[0121] In some implementations, after executing the target task, the method further includes: obtaining the execution result of the target task; obtaining a log fragment based on the execution result and writing the log fragment to a fault information log; and generating an error response in response to determining that the execution of the target task has failed based on the execution result and returning the error response to the target object.
[0122] In some embodiments, the results of asynchronous operations are collected, which are the results of the execution of the target task.
[0123] The execution results are processed or transformed as necessary to obtain log fragments. For example, to record the execution results into the BMC flash, some log format conversion processing is required to obtain log fragments. These log fragments facilitate reporting to the SEL or black box. The log fragments are then written to the fault information log stored in the BMC flash.
[0124] In response to a determination that the target task has failed based on the execution result, the error information is logged and a corresponding error response is prepared. This error response is then returned to the target object, such as the AHCEE caller. Additionally, the processing result is returned to the AHCEE caller via callback, parsing, or event triggering, because calls may involve PCIe and memory CE, requiring the main program to check the call results and parameters.
[0125] In this embodiment, the execution results are converted into log fragments and written to the fault information log. The execution status of each target task can be determined through the fault information log, which facilitates subsequent troubleshooting and performance optimization.
[0126] In some implementations, the method further includes: acquiring process information for executing the target task; and generating a process log for executing the target task based on the process information and fault information recording requests.
[0127] In some embodiments, process information for executing the target task is obtained, including: processing steps, results, and errors of the entire AHCEE process.
[0128] Based on the process information and fault information recording requests, a process log for executing the target task is generated. The process log records the entire AHCEE processing process, including: fault information recording requests, processing steps, results, and errors.
[0129] In this embodiment, a process log for executing the target task is generated based on the process information and fault information recording request for executing the target task. The process log is used to record the entire AHCEE processing process, which facilitates subsequent fault diagnosis and performance optimization.
[0130] In some implementations, after executing the target task, the method further includes: determining the occupied resources corresponding to the target task in the baseboard management controller; and releasing the occupied resources.
[0131] In some embodiments, the occupied resources corresponding to the target task in the baseboard management controller are determined, such as any occupied resources, like database connections, file handles, etc. Some models require recording more information, and the BMC has an external interface to connect to the database. The aforementioned occupied resources are then released.
[0132] In this embodiment, occupied resources are released to achieve resource cleanup and recycling, improve resource utilization, and ensure the correct release of resources to avoid resource leakage and performance problems.
[0133] In some implementations, the method further includes: determining whether a preset error occurs in the baseboard management controller during the execution of the target task; and generating and outputting error information in response to determining that a preset error has occurred in the baseboard management controller.
[0134] In some embodiments, preset errors include, for example, Flash write failure, timeout, and interface call identification. During the execution of the target task, i.e., the AHCEE process, it is determined whether a preset error has occurred in the BMC. In response to the determination that a preset error has occurred in the BMC, a corresponding alarm mechanism is triggered, generating and outputting error information to notify relevant personnel for handling. Outputting error information can be done by, for example, recording the error information in the BMC log or printing it via serial port.
[0135] In this embodiment, it is determined whether a preset error has occurred in the baseboard management controller. In response to determining that a preset error has occurred, an error message is generated and output to notify relevant personnel for processing, so as to avoid failure to record error information due to a malfunction of the baseboard management controller.
[0136] According to an embodiment of this application, an embodiment of a server fault information recording method is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a basic input / output system. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0137] This embodiment provides a server fault information recording method, which can be used in a basic input / output system. Figure 3 is a flowchart of the server fault information recording method according to an embodiment of this application. As shown in Figure 3, the process includes the following steps:
[0138] Step S301: Obtain server fault information and generate a preset number of fault information record requests based on the fault information.
[0139] In some embodiments, the Basic Input / Output System (BIOS) monitors memory and PCIe devices for CE (Error Detection) errors via the memory controller. By modifying the BIOS to reduce the frequency of CE reporting, and setting a threshold for the frequency of memory errors in the same slot and at the same address, the CE reporting frequency can be converged. With the above modification, after the BIOS detects a CE error, it will generate a preset number of fault information record requests after the number of CE errors reaches the threshold. The fault information record requests contain CE fault information. The preset number is, for example, 5, 6, 7… The number is set according to actual needs.
[0140] Step S302: A preset number of fault information recording requests are sent to the baseboard management controller. The fault information recording request is used to instruct the baseboard management controller to create a target task. The baseboard management controller is used to send the processing result corresponding to the fault information recording request to the basic input / output system and execute the target task. The target task is used to record the fault information corresponding to the fault information recording request.
[0141] In some embodiments, a preset number of fault information recording requests are sent to the baseboard management controller. Upon receiving the preset number of fault information recording requests, the BMC initiates the AHCEE processing flow to verify whether the fault information recording request contains the necessary data and parameters, such as data indicating the operation to be performed. In response to determining that the request contains judgment data and parameters, the AHCEE processing flow creates an asynchronous task within the BMC, using this asynchronous task as the target task corresponding to the fault information recording request. The target task is used to record the fault information corresponding to the fault information recording request.
[0142] After creating the target task, the BMC returns the SMI processing result to the BIOS, which means sending the processing result corresponding to the fault information recording request to the BIOS. Simultaneously with returning the processing result to the BIOS, the BMC invokes the asynchronous fault log information processing flow to execute the target task and record the fault log information.
[0143] Step S303: In response to confirming that the processing result has been received, control the central processing unit to exit the preset mode and continue processing business data.
[0144] In some embodiments, after receiving the processing result, the BIOS controls the CPU to exit a preset mode, such as SMM mode. After exiting the preset mode, the CPU continues to process business data.
[0145] The server fault information recording method provided in this embodiment involves the Basic Input / Output System (BIOS) generating a preset number of fault information recording requests based on the fault information, and sending these requests to the Baseboard Management Controller (BMC), instructing the BMC to record the fault information. Upon receiving the fault information recording requests, the BMC first creates a target task and then returns the processing result to the BIOS, enabling the Central Processing Unit (CPU) to exit the preset mode as quickly as possible. Simultaneously, the target task is executed asynchronously, achieving asynchronous processing of the correctable error reporting logs corresponding to the Baseboard Management Controller (BMC), shortening the time the CPU remains in the preset mode, and reducing the impact of correctable component errors on service performance. This solves the problem that recording CE fault information leads to server performance degradation, increased network jitter, and response latency.
[0146] In some implementations, obtaining server fault information and generating a preset number of fault information record requests based on the fault information includes: monitoring the server and determining whether there are correctable errors in the server; in response to determining that there are correctable errors in the server, obtaining fault information of correctable errors; determining whether the number of fault information is greater than or equal to a preset threshold; and in response to determining that the number of fault information is greater than or equal to the preset threshold, generating a preset number of fault information record requests based on the fault information.
[0147] In some embodiments, the server is monitored to determine whether there are correctable errors in the server. For example, the BIOS is configured to monitor the memory controller for CE errors in each RANK of memory and to monitor the CE errors in each root port of the PCIe device.
[0148] In response to the determination that a correctable error exists in the server, fault information for the correctable error is obtained. It is then determined whether the number of fault information entries is greater than or equal to a preset threshold. The preset threshold is set in advance, for example: 5, 6, 7… The number is set according to actual needs. In response to the determination that the number of fault information entries is greater than or equal to the preset threshold, a preset number of fault information record requests are generated based on the fault information.
[0149] In this embodiment, the BIOS is modified to reduce the frequency of CE reporting for memory in the same slot and at the same address by setting a threshold, thereby avoiding the central processing unit from frequently entering the preset mode and reducing the impact on server performance caused by frequent reporting of correctable errors.
[0150] In some implementations, a method for uploading fault information of CE and UCE separately may include steps A1 to A4.
[0151] Step A1: The BIOS reads the preset registers to determine whether a UCE has occurred in the memory and PCIe devices; in response to determining that no UCE has occurred in the memory and PCIe devices, step A2 is executed; in response to determining that a UCE has occurred in the memory and PCIe devices, step A3 is executed.
[0152] In some embodiments, the BIOS needs to read a preset register to determine if a UCE (User Memory Error) has occurred. Patrol Scrub refers to the memory patrol and cleanup process; a UCE indicates a malfunction in the memory patrol and cleanup function. A UCE means that memory cannot be automatically cleared after a process closes, causing memory to continuously occupy memory as the number of processes increases, severely impacting server performance. This step requires a preset register to indicate the UCE status by default. For example, bits 31-16 of the MSR_MC13_STATUS register can be used. If the BIOS detects a value of 0x0010 for this bit, it indicates a memory degradation CE (User Memory Error); otherwise, it's a normal CE.
[0153] Additionally, since UCEs typically occur every few days, it can be configured to read the default register every hour. Alternatively, it can be configured to trigger a read, meaning the default register is read each time a CE error occurs in memory. Since it's impossible to determine whether a CE or a regular CE error occurs, the BIOS can be triggered to read the default register.
[0154] Step A2: Determine whether the normal CE count meets the error leakage threshold; in response to determining that the normal CE count meets the error leakage threshold, execute step A3.
[0155] In some embodiments, in response to determining that the BIOS did not detect a UCE by reading a preset register, it further determines whether the ordinary CE count meets the error leakage threshold. Typically, the error leakage threshold is incremented by one for each ordinary CE. Since ordinary CEs are correctable memory errors with relatively small impacts, repairs can be performed uniformly after a certain number of ordinary CEs have accumulated. Of course, it is understood that even ordinary CEs should not accumulate excessively to avoid affecting the normal operation of the server. Therefore, when the BIOS does not detect a UCE, it is still necessary to determine whether the ordinary CE count has reached the error leakage threshold.
[0156] Step A3 triggers an SMI interrupt.
[0157] In some embodiments, an SMI interrupt should be triggered in response to the determination of a UCE (Unique Memory Error) or the normal CE count meeting the error threshold, causing the CPU to enter SMM (System Memory Mode). After triggering the SMI interrupt, operations such as repairing normal CEs or downgrading UCEs can be performed. Triggering the SMI interrupt upon the occurrence of a UCE prevents the BIOS from using it as a normal CE count, allowing the BIOS to detect UCEs promptly. This enables server administrators to obtain memory fault information in a timely manner, facilitating the monitoring of server memory status.
[0158] Step A4: After triggering the SMI interrupt, send an error message command to the BMC; the error message command contains the reason for the SMI interrupt.
[0159] In some embodiments, an error message instruction is sent to the BMC to indicate that a UCE (Unique Memory Error) has occurred or the normal CE count has reached the error threshold, requiring relevant repair work. The BMC can record this error message instruction and generate a corresponding system log, which facilitates those skilled in the art to monitor the memory status and is beneficial for maintaining the server's memory.
[0160] Furthermore, the BMC determines the cause of the SMI interrupt based on the error message instruction, specifically whether it is a UCE or a regular CE. The corresponding handling methods for UCE and regular CE are different. In response to determining that the error message instruction corresponds to a UCE, the BIOS downgrades the UCE to a regular CE; in response to determining that the error message instruction corresponds to a regular CE, the BIOS terminates the process corresponding to the regular CE. Severe UCEs may require a power-off and restart, or require on-site repair by someone skilled in the art.
[0161] In this embodiment, the BMC can record fault information of CE and UCE, enabling server maintenance personnel to obtain memory fault information in a timely manner, which is beneficial for monitoring the server memory status. Furthermore, in this embodiment, the BMC can also repair CE and UCE separately, allowing the server system to function normally.
[0162] In some embodiments, a server fault information recording method is provided, which can solve the same technical problem as steps S201-S202, or solve the same technical problem as steps S301-S303. Figure 4 is a flowchart of a server fault information recording method according to an embodiment of this application. As shown in Figure 4, the process includes the following steps:
[0163] The server powers on, starts the BIOS, and then starts the BMC. The BIOS process includes: configuring the memory controller to monitor CE (Complete Error) reports for each memory RANK and each PCIe root port; determining if a threshold has been exceeded; if not, re-executing the BIOS configuration for CE reporting for each memory RANK and PCIe root port; if exceeding the threshold, setting the CE overflow value; and then setting the CE overflow value to zero. The BIOS then enters the SMI (Signal Mitigation Indicator) fault handling function, sending fault information to the BMC via IPMI commands, and clearing the CE overflow value. After receiving the SMI processing result from the BMC, the SMI fault reporting function ends.
[0164] The BMC operation process includes: BMC monitoring the IPMI information sent by the BIOS, determining whether there is new IPMI processing information, and if it is determined that there is no new IPMI processing information, BMC re-executing the monitoring of IPMI information sent by the BIOS. If it is determined that there is new IPMI processing information, BMC kCs processing program is started, BMC AHCEE processing begins, BMC returns SMI processing result, AHCEE task is created, AHCEE enters task scheduling, AHCEE information processing module, AHCEE log is recorded to Flash processing module, AHCEE result feedback, AHCEE log recording, and AHCEE resource reclamation.
[0165] In some embodiments, the reference code for the BMC execution process in the above process is as follows:
[0166] In this implementation, by modifying the BIOS boot parameters, correctable errors at the same location and address are configured to be reported only after reaching a threshold. The interval between each BIOS KCS (Knowledge, Control, and Component) message transmission is reduced by modifying the BIOS parameters. Asynchronous processing of the BMC's correctable error reporting logs is achieved by modifying the BMC fault log information processing logic. This reduces the processing time for correctable errors from memory components and PCIe devices, minimizing the impact of frequent correctable error reporting on server performance.
[0167] This embodiment also provides a server fault information recording device, which is used to implement the above embodiments and implementation methods, and will not be repeated as already described. As used below, the term "module" can be a combination of software and / or hardware that implements a predetermined function. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0168] This embodiment provides a server fault information recording device, which is deployed on the baseboard management controller, as shown in Figure 5, and includes:
[0169] The task creation module 501 is used to create a target task corresponding to the fault information recording request in response to determining that a preset number of fault information recording requests have been received from the basic input / output system. The target task is used to record the fault information corresponding to the fault information recording request.
[0170] The task execution module 502 is used to send the processing result corresponding to the fault information recording request to the basic input / output system, and execute the target task to complete the recording of fault information. The processing result is used to instruct the basic input / output system to control the central processing unit to exit the preset mode and continue processing business data.
[0171] In some implementations, the task creation module 501 includes: a verification unit for performing data integrity verification on the fault information recording request; and a first creation unit for creating a target task in response to determining that the fault information recording request has passed verification.
[0172] In some implementations, the verification unit includes: a first acquisition submodule, used to acquire the structure data corresponding to the fault information recording request, and determine whether the format of the structure data is the same as the preset format, wherein the preset format is used to verify the permission information of the fault information recording request;
[0173] The second acquisition submodule is used to acquire the parameter type and number of parameters of the structure data if the format of the structure data is the same as the preset format; the judgment submodule is used to verify the fault information record request if the parameter type is the same as the preset type and the number of parameters is the same as the preset value.
[0174] In some implementations, the task creation module 501 includes: a second creation unit, configured to create an asynchronous task inside the baseboard management controller according to a fault information recording request, and to use the asynchronous task as the target task; a first acquisition unit, configured to add the target task to the task queue and acquire the priority of the target task in the task queue; and a first determination unit, configured to determine the execution order of the target tasks in the task queue according to the priority.
[0175] In some implementations, the task execution module 502 includes: a second determining unit, configured to determine the task to be executed in the task queue according to the task scheduler and the execution order; a calling unit, configured to call an information processing object to process the task to be executed, obtain the processed information, and call a log recording object to generate a fault information log based on the processed information; and a saving unit, configured to save the fault information log to the flash memory of the baseboard management controller.
[0176] In some embodiments, the device further includes: a first acquisition module for acquiring the execution result of the target task; a writing module for obtaining a log fragment based on the execution result and writing the log fragment into a fault information log; and a first generation module for generating an error response in response to determining that the execution of the target task has failed based on the execution result, and returning the error response to the target object.
[0177] In some embodiments, the device further includes: a second acquisition module for acquiring process information of executing the target task; and a second generation module for generating a process log of executing the target task based on the process information and fault information recording request.
[0178] In some embodiments, the device further includes: a determination module for determining the occupied resources corresponding to the target task in the baseboard management controller; and a release module for releasing the occupied resources.
[0179] In some embodiments, the device further includes: a judgment module for judging whether a preset error occurs in the substrate management controller during the execution of the target task; and a third generation module for generating and outputting error information in response to determining that a preset error has occurred in the substrate management controller.
[0180] Further functional descriptions of the above modules and units are the same as those in the corresponding embodiments described above, and will not be repeated here.
[0181] This embodiment provides a server fault information recording device, which is deployed in a basic input / output system, as shown in Figure 6. The device includes: an information acquisition module 601, used to acquire server fault information and generate a preset number of fault information recording requests based on the fault information; an information sending module 602, used to send the preset number of fault information recording requests to a baseboard management controller, wherein the fault information recording request instructs the baseboard management controller to create a target task, the baseboard management controller sends the processing result corresponding to the fault information recording request to the basic input / output system, and executes the target task, the target task being used to record the fault information corresponding to the fault information recording request; and a control module 603, used to, in response to confirming receipt of the processing result, control the central processing unit to exit a preset mode and continue processing business data.
[0182] In some implementations, the information acquisition module 601 includes: a first judgment unit for monitoring the server and judging whether there are correctable errors in the server; a second acquisition unit for acquiring fault information of correctable errors in response to judging whether the number of fault information is greater than or equal to a preset threshold; and a generation unit for generating a preset number of fault information record requests based on the fault information in response to judging whether the number of fault information is greater than or equal to the preset threshold.
[0183] Further functional descriptions of the above modules and units are the same as those in the corresponding embodiments described above, and will not be repeated here.
[0184] In this embodiment, the server fault information recording device is presented in the form of a functional unit. Here, a unit refers to an ASIC (Application Specific Integrated Circuit) circuit, a processor and memory that execute one or more software or fixed programs, and / or other devices that can provide the above functions.
[0185] This application also provides a computer device having the server fault information recording device shown in FIG5 or FIG6 above.
[0186] Please refer to Figure 7, which is a schematic diagram of a computer device according to an embodiment of this application. As shown in Figure 7, the computer device includes: one or more processors 10, memory 20 associated with the one or more processors 10, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components communicate with each other using different buses and can be installed on a common motherboard or otherwise as needed. The processors can process instructions executed within the computer device, including instructions stored in or on memory to display graphical information of a GUI on an external input / output device (such as a display device coupled to the interface). In some embodiments, multiple processors and / or multiple buses can be used with multiple memories and multiple memory modules, if needed. Similarly, multiple computer devices can be connected, each providing some of the necessary operations (e.g., as a server array, a group of blade servers, or a multiprocessor system). Figure 7 uses one processor 10 as an example.
[0187] Processor 10 may be a central processing unit, a network processor, or a combination thereof. Processor 10 may further include an integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field-programmable gate array (FPGA), a general-purpose array logic (GPA), or any combination thereof.
[0188] The memory 20 stores instructions executable by at least one processor 10 to cause at least one processor 10 to perform the method shown in the above embodiments.
[0189] The memory 20 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the computer device. Furthermore, the memory 20 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 20 includes memory remotely located relative to the processor 10, and these remote memories can be connected to the computer device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0190] The memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk or solid-state drive; the memory 20 may also include a combination of the above types of memory.
[0191] The computer device also includes a communication interface 30 for communicating with other devices or communication networks.
[0192] As shown in Figure 8, this application embodiment also provides a non-volatile computer-readable storage medium. The methods described above according to this application embodiment can be implemented in hardware or firmware, or implemented as computer code that can be recorded on a storage medium, or implemented as computer code downloaded via a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and then stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes storage components capable of storing or receiving software or computer code. When the software or computer code is accessed and executed by the computer, processor, or hardware, the methods shown in the above embodiments are implemented.
[0193] A portion of this application can be applied as a computer program product, such as computer-readable instructions, which, when executed by one or more processors, can invoke or provide methods and / or technical solutions according to this application. Those skilled in the art will understand that the forms in which computer-readable instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, and installation package files. Correspondingly, the ways in which computer-readable instructions are executed by a computer include, but are not limited to: the processor directly executing the instructions; the processor compiling the instructions and then executing the corresponding compiled program; the processor reading and executing the instructions; or the processor reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.
[0194] Although embodiments of this application have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of this application, and such modifications and variations all fall within the scope defined by this application.
Claims
1. A server failure information recording method characterized by comprising: The method is applied to a baseboard management controller, the method comprising: In response to determining that a preset number of fault information recording requests have been received from the basic input / output system, a target task corresponding to each fault information recording request is created, wherein the target task is used to record the fault information corresponding to the fault information recording requests; and The system sends the processing result corresponding to the fault information recording request to the basic input / output system and executes the target task to complete the recording of the fault information. The processing result is used to instruct the basic input / output system to control the central processing unit to exit the preset mode and continue processing business data.
2. The method of claim 1, wherein, The target task corresponding to the request to create the fault information record includes: Perform data integrity verification on the fault information recording request; and In response to the confirmation that the fault information record request has passed verification, the target task is created.
3. The method of claim 2, wherein, The data integrity verification of the fault information recording request includes: Obtain the structure data corresponding to the fault information recording request, and determine whether the format of the structure data is the same as the preset format, wherein the preset format is used to verify the permission information of the fault information recording request; In response to determining that the format of the structure data is the same as a preset format, the parameter types and number of parameters of the structure data are obtained; and In response to the determination that the parameter type is the same as the preset type and the parameter quantity is the same as the preset value, the fault information record request verification passes.
4. The method of claim 1, wherein, The target task corresponding to the request to create the fault information record includes: Based on the fault information recording request, an asynchronous task is created inside the baseboard management controller, and the asynchronous task is used as the target task. Add the target task to the task queue and obtain the priority of the target task in the task queue; and The execution order of the target tasks in the task queue is determined based on the priority.
5. The method of claim 4, wherein, The execution of the target task includes: Based on the task scheduler and the execution order, determine the tasks to be executed in the task queue; The system calls an information processing object to process the task to be executed, obtains the processed information, and calls a log recording object to generate a fault information log based on the processed information; and The fault information log is saved to the flash memory of the baseboard management controller.
6. The method of claim 5, wherein, After performing the target task, the method further includes: Obtain the execution result of the target task; Log fragments are obtained based on the execution results, and these log fragments are written into the fault information log; and In response to determining that the execution of the target task has failed based on the execution result, an error response is generated and returned to the target object.
7. The method of claim 1, wherein, The method further includes: Obtain process information for executing the target task; and Based on the process information and the fault information recording request, a process log for executing the target task is generated.
8. The method of claim 1, wherein, After performing the target task, the method further includes: Determine the occupied resources corresponding to the target task in the baseboard management controller; and Release the occupied resources.
9. The method of claim 1, wherein, The method further includes: Determine whether a preset error occurs in the baseboard management controller during the execution of the target task; and In response to determining that the preset error has occurred in the baseboard management controller, an error message is generated and output.
10. The method of claim 1, wherein, The target task corresponding to the request to create the fault information record includes: Verify whether the fault information recording request contains the necessary data and parameters; and In response to determining that the fault information recording request contains the necessary data and parameters, an asynchronous task is created within the baseboard management controller, and the asynchronous task is used as the target task corresponding to the fault information recording request.
11. The method of claim 1, wherein, The preset mode is the system management mode.
12. The method of claim 1, wherein, The data integrity verification of the fault information recording request includes: Obtain a reference check value, which is calculated by the basic input / output system using a data integrity verification algorithm on the fault information recording request; and The baseboard controller uses the data integrity verification algorithm to calculate the current verification value of the fault information recording request, and compares the current verification value with the reference verification value to perform data integrity verification on the fault information recording request.
13. A server failure information recording method characterized by comprising: The method is applied to a basic input / output system, and the method includes: Obtain server fault information and generate a preset number of fault information record requests based on the fault information; A preset number of fault information recording requests are sent to the baseboard management controller, wherein the fault information recording request is used to instruct the baseboard management controller to create a target task, the baseboard management controller is used to send the processing result corresponding to the fault information recording request to the basic input / output system, and execute the target task, the target task being used to record the fault information corresponding to the fault information recording request; and In response to the confirmation that the processing result has been received, the central processing unit is controlled to exit the preset mode and continue processing business data.
14. The method of claim 13, wherein, The step of obtaining server fault information and generating a preset number of fault information record requests based on the fault information includes: Monitor the server to determine if there are any correctable errors in the server; In response to determining that the correctable error exists in the server, the fault information of the correctable error is obtained; Determine whether the number of fault messages is greater than or equal to a preset threshold; and In response to determining that the number of fault information is greater than or equal to a preset threshold, a preset number of fault information record requests are generated based on the fault information.
15. The method of claim 13, wherein, The target task corresponding to the request to create the fault information record includes: Verify whether the fault information recording request contains the necessary data and parameters; and In response to determining that the fault information recording request contains the necessary data and parameters, an asynchronous task is created within the baseboard management controller, and the asynchronous task is used as the target task corresponding to the fault information recording request.
16. The method of claim 13, wherein, The preset mode is the system management mode.
17. A server failure information recording apparatus characterized by comprising: The device is deployed on a substrate management controller, and the device includes: A task creation module is configured to, in response to determining that a preset number of fault information recording requests have been received from the basic input / output system, create a target task corresponding to the fault information recording requests, wherein the target task is used to record the fault information corresponding to the fault information recording requests; and The task execution module is used to send the processing result corresponding to the fault information recording request to the basic input / output system, and execute the target task to complete the recording of the fault information. The processing result is used to instruct the basic input / output system to control the central processing unit to exit the preset mode and continue processing business data.
18. A server failure information recording apparatus characterized by comprising: The device is deployed in a basic input / output system, and the device includes: The information acquisition module is used to acquire server fault information and generate a preset number of fault information record requests based on the fault information. An information sending module is used to send a preset number of fault information recording requests to a baseboard management controller, wherein the fault information recording request is used to instruct the baseboard management controller to create a target task, the baseboard management controller is used to send the processing result corresponding to the fault information recording request to the basic input / output system, and execute the target task, the target task being used to record the fault information corresponding to the fault information recording request; and The control module is used to control the central processing unit to exit the preset mode and continue processing business data in response to the determination that the processing result has been received.
19. A computer device, comprising: include: One or more processors; as well as A memory associated with the one or more processors, the memory being used to store computer-readable instructions that, when read and executed by the one or more processors, implement the server fault information recording method as described in any one of claims 1 to 16.
20. A non-transitory computer readable storage medium, comprising: The non-volatile computer-readable storage medium stores computer-readable instructions that, when executed by one or more processors, implement the server fault information recording method as described in any one of claims 1 to 16.