Memory fault handling systems, methods, electronic devices and program products

By constructing a dual-process communication system in computer equipment and dynamically adjusting data acquisition parameters, the performance impact and energy waste caused by memory fault handling in existing technologies are solved. This enables early warning and proactive repair of memory faults, improving the reliability and availability of the system.

CN122309216APending Publication Date: 2026-06-30INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies, when dealing with memory failures, suffer from high-frequency data acquisition, which affects computer equipment performance, increases energy consumption, and wastes system resources. They also fail to achieve real-time prediction and fault repair, resulting in insufficient system reliability and availability.

Method used

A dual-process communication method is adopted, which uses an out-of-band management bus between the central processing unit and the management controller to exchange data. The first process and the second process are isolated from each other. The first process collects memory error information, and the second process analyzes and predicts and triggers fault repair. The data acquisition parameters are dynamically adjusted to reduce interference to the system.

Benefits of technology

It enables early warning and proactive handling of memory failures, improves the reliability and availability of computer equipment, reduces operation and maintenance costs, reduces the pressure of frequent access to out-of-band processors and out-of-band buses, reduces power consumption, and ensures real-time response capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309216A_ABST
    Figure CN122309216A_ABST
Patent Text Reader

Abstract

This invention discloses a memory fault handling system, method, electronic device, and program product, relating to the field of computer technology. The system includes memory, a management controller, and a central processing unit (CPU). The CPU reads system operation data from the memory via a physical link and determines memory error information. The management controller constructs two isolated processes. During the process of the first process collecting memory error information through an out-of-band interface, it listens for a trigger signal to adjust the collection parameters and adjusts the initial data collection parameters according to the signal type of the trigger signal. The second process receives the memory error information sent by the first process and triggers a fault repair operation when a memory fault is predicted based on the memory error information. This invention can solve the problems of high-frequency data collection operations affecting device performance, increasing energy consumption, and wasting resources in related technologies, enabling low-interference memory fault prediction and repair, and effectively improving memory reliability and availability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a memory fault handling system, method, electronic device, and program product. Background Technology

[0002] Memory is used to store data and / or instructions used by computer devices to perform computational tasks. Effective and timely handling of memory failures ensures the reliability and stability of computer devices. Related technologies collect memory-related data through high-frequency out-of-band polling and analyze the data to handle memory failures. However, this method not only affects the performance of the out-of-band processor but also increases overall system power consumption and wastes system resources. Summary of the Invention

[0003] This invention provides a memory fault handling system, method, electronic device, and computer program product, which solves the problems of high-frequency data acquisition operations affecting the performance of computer equipment, increasing energy consumption, and wasting resources. It achieves low-interference memory fault prediction and fault repair, effectively improving the reliability and availability of memory.

[0004] To solve the above-mentioned technical problems, the present invention provides the following technical solution: This invention provides a memory fault handling system, including memory, management controller and central processing unit; The central processing unit is connected to the memory via a physical link and to the management controller via an out-of-band management bus; The memory stores at least the system's operational data; the central processing unit is configured to read the system's operational data via physical links and determine memory error information from the system's operational data. The management controller is configured as follows: a first process and a second process are built that are isolated from each other. During the process of the first process collecting memory error information through the out-of-band interface according to the initial data acquisition parameters, the first process listens for the acquisition parameter adjustment trigger signal and adjusts the signal type of the trigger signal according to the acquisition parameters, and adjusts the initial data acquisition parameters accordingly. The second process receives the memory error information sent by the first process and triggers the fault repair operation when a memory fault is predicted based on the memory error information.

[0005] This invention provides a memory fault handling method, applied to a management controller, comprising: Establish a first and second process that are mutually isolated; During the process of collecting memory error information using the first process according to the initial data acquisition parameters, the acquisition parameter adjustment trigger signal is monitored, and the initial data acquisition parameters are adjusted accordingly based on the signal type of the acquisition parameter adjustment trigger signal. The first process sends memory error information to the second process, which then predicts memory failures. When the second process predicts a memory failure, it triggers a fault repair operation.

[0006] The present invention also provides an electronic device, including a memory and a processor, wherein the processor is used to implement the steps of the memory fault handling method described above when executing a computer program stored in the memory.

[0007] Finally, the present invention also provides a computer program product, including a computer program / instruction that, when executed by a processor, implements the steps of the above-described memory fault handling method.

[0008] The advantages of the technical solution provided by this invention are as follows: The central controller reads system operation data from memory and identifies the presence of memory error information. The management controller constructs two isolated processes: a first process and a second process. The first process collects memory error information from the central processor out-of-band, while the second process interacts with the first process to analyze and predict the memory error information collected by the first process. When a memory fault is predicted, a repair operation is triggered, thereby achieving early warning and proactive handling of memory faults. This improves the overall reliability and availability of the computer equipment and reduces maintenance costs. Furthermore, the first and second processes are isolated from each other, decoupling the data acquisition function from the fault prediction function. The two processes can run, update, and be fault-tolerant independently. An anomaly in one process does not affect the normal operation of the other, further enhancing the reliability and maintainability of the entire computer equipment. The management controller listens for different types of acquisition parameter adjustment trigger signals and dynamically adjusts the currently used data acquisition parameters accordingly. This achieves a shift from fixed-period polling to adaptive acquisition, significantly reducing the pressure on the management controller (an out-of-band processor) and out-of-band bus, reducing unnecessary power consumption, and ensuring immediate response to critical memory error events. This dynamic and adaptive sampling method completely eliminates high-frequency interference to the central processing unit and memory-related systems during data acquisition, significantly reducing the storage and network bandwidth resource consumption of the management controller. By adaptively acquiring memory fault data combined with a dual-processing approach, it enables low-interference memory fault prediction and repair, is compatible with the hardware characteristics of different computer devices, and has better practicality. Furthermore, this invention also provides corresponding methods, electronic devices, and computer program products for memory fault handling systems, which offer corresponding advantages. Attached Figure Description

[0009] To more clearly illustrate the technical solutions of the present invention or related technologies, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0010] Figure 1 This is a schematic diagram of an exemplary embodiment of the memory fault handling system provided by the present invention; Figure 2 This is a flowchart illustrating an adaptive data acquisition method provided by the present invention. Figure 3 This is a schematic diagram illustrating the error handling process under different hardware architectures provided by the present invention. Figure 4 A flowchart illustrating the memory fault handling system provided by the present invention in a first exemplary hardware architecture application scenario; Figure 5 This is a flowchart illustrating the application scenario of the memory fault handling system provided by the present invention in a second exemplary hardware architecture. Figure 6 A flowchart illustrating a memory fault handling method provided by the present invention; Figure 7 This is a structural framework diagram of an exemplary embodiment of the memory fault handling device provided by the present invention. Detailed Implementation

[0011] To enable those skilled in the art to better understand the technical solutions of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. In this specification and the aforementioned drawings, the terms "first," "second," "third," "fourth," etc., are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior to or better than other embodiments.

[0012] Memory fault handling mechanisms are predictive, diagnostic, and repair mechanisms established by computer devices such as servers for the memory component. They constitute a proactive defense system to improve the reliability of computer equipment. Traditional memory fault handling relies on hardware-level error detection and correction mechanisms, such as continuously monitoring memory health through memory controllers and baseboard management controllers, and passively responding based on ECC (Error Checking and Correcting) counting thresholds (which can be pre-defined by administrators or other users based on actual operational needs or prior knowledge). Related technologies achieve basic monitoring through out-of-band data acquisition, but they have drawbacks, including: using fixed-period acquisition methods (such as 1 second or 2 seconds) leads to out-of-band bus congestion; out-of-band buses such as PECI (Platform Environment Control Interface) or ACPI (Advanced Configuration and Power Interface) increase CPU (Central Processing Unit) power consumption and system jitter; in addition, high-frequency polling can waste storage bandwidth and pose potential security risks; and at the same time, it is impossible to dynamically adjust the acquisition frequency according to system load, affecting real-time performance and energy efficiency.

[0013] For AMD architecture computer devices, the processor of these computer devices has an integrated hardware debugging engine called ADDC (Autonomous Debug Data Collection), which is responsible for collecting out-of-band error data. The ADDC's Out-of-Band (OOB) access path is as follows: It interfaces with the BMC via ACPI, completely bypassing the CPU (Central Processing Unit), OS (Operating System), and system firmware. Even when the processor is unstable or in a suspended state, such as clock loss, x86 pipe stoppage, or SMU (System Management Unit) unresponsive, the BMC initiates the transaction, and UCE (Uncorrectable Error) and CE (Correctable Error) can still be read from the registers. This method supports a first-fault acquisition strategy, meaning that before system reset or restart, the BMC (Baseboard Management Controller) triggers the ADDC logic, activating the MCA (Machine Check Architecture) group, EX-WDT (Execution Unit Watchdog Timer), PCIe (Peripheral Component Interconnect Express), DF (Data Fabric), NBIO (North Bridge Input / Output), and CXL (Compute Express). Link (compute fast link), FCH (Fusion Controller Hub), DRAMC (Dynamic Random Access Memory Controller), ECC counter, and other full debug data are saved to the BMC's NVMe (Non-Volatile Memory Express) or a designated log area to achieve zero-day debug closed loop; the BMC can periodically poll the ADDC registers to discover the cumulative trend of correctable errors without waiting for fatal interrupts, and supports continuous runtime monitoring.

[0014] In summary, AMD architecture computer devices can be configured with ADDC parameters to achieve out-of-band monitoring of memory CE. For example, the periodic polling of DRAM correctable error checking and correcting polling can be enabled or disabled through the parameter DRAM Cecc Polling En (DRAM Correctable Error Checking and Correcting Polling Enable, DRAM (Dynamic Random Access Memory Controller)). The data acquisition frequency can be controlled by DRAM Cecc Polling Period (the time interval between polling operations). For example, if DRAM Cecc Polling En is set to true and DRAM Cecc Polling Period is set to 5, a memory fault scan can be performed every 5 seconds.

[0015] However, these computer devices only provide basic ECC error correction and reactive Page Retirement (PRL) mechanisms. They can only passively mark faulty pages upon the next restart after correctable errors accumulate to a BIOS (Basic Input Output System) threshold (a value pre-defined by administrators or other users based on actual operational needs or prior knowledge). They cannot assess impending physical pages in real-time during runtime and perform pre-emptive soft shutdown, thus failing to achieve an online predictive isolation system. This leads to a high probability of uncorrectable errors (UE) exposure, increasing the risk of sudden system crashes or data corruption. Furthermore, sudden fluctuations in available memory capacity affect business continuity. In addition, these computer devices do not disclose any machine learning prediction models, health scoring algorithms, or large-scale deployment data equivalent to the MRT (Memory Resilience Technology) of Intel architecture computer devices. The ecosystem cannot directly call standardized interfaces to achieve pre- and post-event closed-loop protection, forcing server manufacturers to develop their own systems, resulting in high R&D costs and fragmented implementations across different manufacturers. Similarly, this method also suffers from the problem of frequent acquisition affecting out-of-band processor performance, the same as in Intel-based computer devices, which will not be elaborated here. Even though AMD platforms' APML (Advanced Platform Management Link) chooses I3C (Improved Inter Integrated Circuit) as its underlying transmission protocol, and despite its significant speed advantage over PECI, out-of-band CE acquisition still faces the same problems as MRT out-of-band acquisition.

[0016] In view of this, in order to solve the problems existing in the memory fault handling process of computer equipment, this invention can implement Intel MRT-like functions on the existing ADDC out-of-band autonomous acquisition framework of the AMD platform by adopting a dual-process communication method. It can completely eliminate high-frequency interference to CPU and memory-related systems with a dynamic and adaptive sampling method, and significantly reduce the storage resources and network bandwidth resources on the BMC side. This achieves both the reduction of R&D costs and the reduction of the impact of frequent data acquisition on out-of-band processor performance.

[0017] Various non-limiting embodiments of the present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. Please refer to... Figure 1 The memory fault handling system provided by this invention, in processing memory faults in computer devices, may include the following: The memory fault handling system includes at least memory 101, a management controller 102, and a central processing unit (CPU) 103. The CPU 103 is connected to the memory 101 via a physical link. For example, the physical link could be the PCB (Printed Circuit Board) traces and pins of the motherboard. Any physical link between the CPU and memory of any computer device can be used, and this invention does not limit this. The CPU 103 is connected to the management controller 102 via an out-of-band management bus. The out-of-band management bus refers to the actual physical link or protocol link that supports interaction and signal / data transmission between the management controller 102 and the CPU 103. For example, when the management controller 102 and the CPU 103 communicate via APML (Advanced Platform Management Link), they are connected via a low-speed serial bus on the motherboard, such as the System Management Bus (SMBus) or the I3C (Improved Inter-Integrated Circuit) bus.

[0018] In this embodiment, the memory 101 stores at least system operating data. The central processing unit 103 is configured to read the system operating data through a physical link and determine memory error information from the system operating data. The management controller 102 is configured to: construct a first process and a second process that are isolated from each other; during the process of the first process collecting memory error information through an out-of-band interface according to the initial data acquisition parameters, it listens to the acquisition parameter adjustment trigger signal and adjusts the signal type of the trigger signal according to the acquisition parameters, and adjusts the initial data acquisition parameters accordingly; the second process receives the memory error information sent by the first process, and when a memory fault is predicted based on the memory error information, it triggers a fault repair operation.

[0019] The system operation data refers to the data generated by the computer device where the memory fault system is located during normal operation. The first process is an independent running program that collects memory error information stored in the registers of the central processing unit 103. It can be directly implemented through the out-of-band data acquisition function built into the computer device (such as AMD's ADDC or Intel's RASoffload), without the need for additional core acquisition logic development, thereby reducing costs. Of course, additional development can be carried out to independently build data acquisition functions according to actual needs, which will not affect the implementation of this invention. The second process is a running program independent of the first process. It receives the collected memory error information and predicts the existence of memory fault risk by analyzing error patterns, frequencies, and other characteristics. A communication channel for transmitting data and interacting signals can be pre-configured between the first and second processes, such as a data pipe. Through the data pipe, secure and efficient data transmission between the two processes can be achieved, realizing information synchronization under process isolation. The management controller 102 constructs two logically isolated software execution entities: the first process and the second process. These two processes can communicate through a data pipe (such as an inter-process communication mechanism based on message queues or shared memory). This isolation ensures their independent operation and maintenance. In practical applications, the second process is usually self-developed, while the first process can be provided by the manufacturer of the computer device's processor architecture, such as AMD or Intel. If the code corresponding to the first process of the computer device has a common interface or switch, and the second process cannot be executed when the switch is turned off, it can be determined that the two have a dependency relationship, that is, the two are not isolated.

[0020] The initial data acquisition parameters are the acquisition parameters used by the first process in the current data acquisition flow or in the current data acquisition round. At system startup, the initial data acquisition parameters are the default acquisition configuration information loaded from a specified location. The initial data acquisition parameters are a set of acquisition parameters, including but not limited to acquisition mode, acquisition period, data acquisition frequency, and event level. If different data sources use different trigger acquisition periods, data acquisition frequencies, or levels, corresponding data source identifiers can be set for different data sources, and corresponding acquisition parameters can be set for that data source. In this scenario, the initial data acquisition parameters are a collection of multiple sets of acquisition parameters, with one set of acquisition parameters corresponding to one data source or one type of data source. The acquisition parameter adjustment trigger signal is a signal that triggers the adjustment of at least one acquisition parameter in the initial data acquisition parameters. Different signal types can be set to adjust different types of acquisition parameters. Memory error information is data related to correctable and uncorrectable errors generated during memory operation, including error type, occurrence location, frequency, etc. Memory error information is stored in the registers of the memory controller. The management controller's monitoring is a low-power, cyclical process. In this process, the first process is responsible for executing data acquisition tasks. Upon initialization, this process begins acquiring memory error information, such as correctable error counts, from the CPU's memory controller according to a set of initial data acquisition parameters (e.g., a default acquisition period of 5 seconds). During acquisition, the management controller listens for acquisition parameter adjustment trigger signals. These signals are the source of instructions for dynamically adjusting acquisition behavior. Depending on the signal type, such as indicating an urgent hardware event, the arrival of a timer period, or changes in system load, the currently used acquisition parameters are adjusted accordingly. For example, when a high-priority event signal is received, an acquisition may be performed immediately; when the system load is too high, the acquisition interval is automatically extended. After acquiring memory error information, the first process can send the acquired memory error information to the second process via a data pipeline. The second process can have a built-in fault analysis program, such as performing fault analysis based on pre-set thresholds based on prior knowledge, or calling fault analysis models, such as machine learning models with fault analysis capabilities, through a specified interface. By analyzing memory error information, it can determine memory health trends and predict whether memory faults exist, such as whether there is a risk of impending uncorrectable errors. Fault repair operations are remedial actions performed in response to predicted memory faults, such as memory page isolation and runtime post-packaging repair. When a second process predicts a memory fault, it triggers the corresponding fault repair operation. For example, it sends a command to the system firmware requesting the isolation of high-risk memory pages, so that the operating system can remove them from the available memory pool before the hardware actually fails, thus preventing data corruption or system crashes.

[0021] In the technical solution provided in this embodiment, the central controller reads system operation data from memory and identifies whether memory error information exists. The management controller constructs two isolated processes: a first process and a second process. The first process collects memory error information from the central processor out-of-band, while the second process interacts with the first process to analyze and predict the memory error information collected by the first process. When a memory fault is predicted, a repair operation is triggered, thereby achieving early warning and proactive handling of memory faults. This improves the overall reliability and availability of the computer equipment and reduces maintenance costs. Furthermore, the first and second processes are isolated from each other, decoupling the data acquisition function from the fault prediction function. The two processes can run, update, and be fault-tolerant independently. An anomaly in one process does not affect the normal operation of the other, further enhancing the reliability and maintainability of the entire computer equipment. The management controller listens for different types of acquisition parameter adjustment trigger signals and dynamically adjusts the currently used data acquisition parameters accordingly. This achieves a shift from fixed-period polling to adaptive acquisition, significantly reducing the frequent access pressure on the management controller (an out-of-band processor) and out-of-band bus, reducing unnecessary power consumption, and ensuring immediate response to critical memory error events. This dynamic and adaptive sampling method completely eliminates high-frequency interference to the central processing unit and memory-related systems during the data acquisition process, significantly reduces the storage and network bandwidth resource consumption of the management controller, and achieves low-interference memory fault prediction and repair by adaptively acquiring memory fault data combined with a dual-process processing method. It is compatible with the hardware characteristics of different computer devices and has better practicality.

[0022] Based on the above embodiments, when implementing adaptive data acquisition, the management controller 102 needs to clearly define the response behavior corresponding to different types of signals to ensure the rationality and real-time nature of the acquisition parameter adjustment. This avoids the phenomenon of untimely response when critical events occur or excessive resource consumption during acquisition operations when the system is under high load, which affects the overall operating efficiency. Accordingly, an exemplary implementation method for the management controller 102 to adjust the initial data acquisition parameters may include the following: The management controller 102 is also configured to: upon detecting an event-driven signal via the out-of-band management bus, interrupt the current acquisition task and perform at least one data acquisition operation on the corresponding hardware at a first preset time. Upon detecting a time-driven signal, update the data acquisition period in the initial data acquisition parameters according to the time-driven signal to adjust the scheduling time of the next acquisition operation. Upon detecting a resource status signal, update the data acquisition frequency in the initial data acquisition parameters according to the resource status signal.

[0023] The difference between the first preset time and the listening time of the event-driven signal is the preset minimum response time, such as 0.1s. The preset minimum response time represents the fastest response speed that the current scenario can support. By controlling the first preset time, it ensures that the data corresponding to the event-driven signal can be collected at the fastest speed allowed by the current scenario. Event-driven signals are signals triggered by external or internal events, such as memory error interrupts and hardware alarms, emphasizing immediate response and belonging to the highest priority response. Time-driven signals are signals generated based on fixed or adjustable periods, such as timer interrupts, used for regular polling. Resource status signals are indicators reflecting the status of system resources, such as CPU utilization, temperature, or bandwidth changes, used to adjust background settings. In practical applications, different types of response behaviors (such as immediate response and periodic adjustment) can be verified through log analysis or performance monitoring tools. For example, check whether the collection operation acts immediately when the event is interrupted, or whether the collection frequency is adjusted according to load changes.

[0024] In this embodiment, when the management controller 102 detects an event-driven signal, it indicates that an event requiring immediate attention has occurred. It will interrupt any ongoing general data acquisition tasks and immediately perform a data acquisition operation on the specific hardware associated with the event (such as the memory channel that triggered the interrupt) to capture transient error information. When the detected, i.e., the input signal, is a time-driven signal, it typically indicates that the preset polling period of a data source has expired. Based on the information carried by this signal, the data acquisition period in the initial data acquisition parameters is updated. For example, the acquisition period of a counter is adjusted from 10 seconds to 15 seconds, thereby adjusting the planned time for the next data acquisition operation on that data source. When a resource status signal, such as a system load report, is detected, it indicates that the overall operating status of the system has changed. Based on this signal, the data acquisition frequency (number of acquisitions per unit time) in the initial data acquisition parameters is updated. For example, the overall acquisition frequency is reduced when the system is under high load to reduce competition for system resources.

[0025] As can be seen from the above, this embodiment clearly classifies signal types into event-driven, time-driven, and resource status-driven methods to achieve multi-dimensional dynamic adjustment of acquisition parameters. This makes the adaptive acquisition logic clearer, more efficient, and more predictable. It can adopt the most suitable data acquisition method for different scenarios (such as emergency faults, regular inspections, and system pressure), avoid data acquisition operations competing for resources with business operations, optimize resource utilization while ensuring real-time monitoring, and ensure the overall stability of system operation.

[0026] For example, the central processing unit 103 includes at least a memory controller. The memory controller is connected to the memory slots via pins and is configured to: read system operating data from memory, perform error detection and correction processing on the system operating data, and update the detected memory error information to the corresponding registers. The memory controller is integrated inside the central processing unit 103. For example, the memory controller can be connected to the memory slots on the motherboard via pins on the bottom of the central processing unit 103 package. Each memory channel can correspond to a set of independent physical lines, such as data lines, address lines, control lines, clock lines, etc. If a multi-channel memory architecture is supported, each memory channel can connect 1-2 DIMMs (Multi-In-line Memory Modules). Memory error information refers to the information recorded by the memory controller when a correctable or uncorrectable error occurs. This information is updated in the internal registers, recording detailed information such as the physical address, row / column position, and error type of the error, which can be collected and analyzed out-of-band by the management controller. The memory controller communicates directly with memory chips, such as dynamic random access memory, via a parallel bus. The link topology is determined by the motherboard wiring design. Taking the AMD platform as an example, the data read process between the CPU 103's memory controller and physical memory (such as dynamic random access memory) can be as follows: When the CPU 103 needs data, it sends a memory read transaction request containing the physical address to the memory controller. The address can be decoded into DIMM, Rank (Independent Addressable Logic Unit), Bank (Rank is divided into constituent units), and row / column, and a read command is sent to the memory chip via the physical link. The memory chip transmits the data of the corresponding storage unit back to the memory controller via the data line. After receiving the data, if ECC is enabled, the memory controller will immediately perform error detection and correction. If an error occurs, it will update its internal registers, recording the error address, type, count, and other information, and then return it to the CPU 103 via the on-chip network. The BMC can indirectly read the memory controller's error register (such as the CE counter) via APML without interrupting the CPU's operation, achieving out-of-band data acquisition.

[0027] Considering that a delayed response to hardware anomalies may lead to missed critical fault information, resulting in the failure to detect and handle faults in a timely manner and increasing the risk of system downtime, this embodiment also provides an implementation method to ensure a fast and accurate response to urgent hardware errors in an event-driven scenario, which may include the following: The central processing unit 103 is also configured to: generate a hardware event signal conforming to the out-of-band management bus transmission format according to the hardware trigger signal when a hardware trigger signal is generated, and send the hardware event signal to the management controller 102 through the out-of-band management bus; the management controller 102 is also configured to: receive the hardware event signal through the out-of-band interface, generate a hardware event interrupt signal, and when the hardware event interrupt signal is detected, pause the current acquisition task and immediately acquire the memory controller register data of the target hardware corresponding to the hardware event interrupt signal.

[0028] The hardware trigger signal is automatically triggered by the central processing unit 103 when a hardware error is detected. For example, when the memory controller detects a correctable error while reading data, it automatically updates its corresponding register, recording information such as the error address, type, and count. When the register value meets a preset threshold condition, the comparator circuit immediately generates an internal hardware trigger signal. The preset threshold condition is one or more trigger threshold values ​​predefined in the register corresponding to the correctable error in the memory controller, based on prior knowledge or the actual scenario. This threshold value can be the cumulative number of correctable errors within a unit time window or the number of consecutive correctable errors occurring at the same physical memory address. When the real-time count value of the register is equal to or exceeds the preset threshold value, the hardware comparator circuit determines that the condition is met and automatically outputs the hardware trigger signal. To send this hardware trigger signal to the management controller, it needs to be packaged and defined as a hardware event signal. For example, through APML communication, an event notification message packet conforming to the APML protocol format can be generated and sent to the management controller 102 via APML. Upon receiving the hardware event signal, the management controller 102 generates a local interrupt signal, also known as a hardware event interrupt signal, to notify that a high-priority event has arrived. The target hardware is the hardware that triggers the hardware event interrupt signal, such as the hardware corresponding to the memory module that caused the error. The memory controller register data is the error-related data stored in the memory controller register by the target hardware, including the error type, physical address, and occurrence time. The out-of-band interface is the interface corresponding to the out-of-band management bus; for example, if the out-of-band management bus is APML, then the out-of-band interface corresponds to the APML interface.

[0029] In this embodiment, while the first process is executing the data acquisition task, the management controller continuously monitors for input hardware event signals to maintain a highly sensitive response to hardware anomalies. When a hardware event signal is received via the out-of-band interface, the management controller generates a hardware event interrupt signal, immediately suspends all currently running non-critical data acquisition tasks, releases relevant resources, and prioritizes responding to the interrupt signal. After identifying the target hardware corresponding to the interrupt signal, the first process can immediately initiate a read operation on the memory controller register of that target hardware to acquire the complete error data stored in the register, ensuring that no critical fault information is missed. After acquisition is completed, the first process quickly sends the batch of critical error data to the second process through a data pipeline, while simultaneously resuming the suspended regular data acquisition tasks or adjusting subsequent acquisition parameters according to the event situation.

[0030] As can be seen from the above, this embodiment responds to hardware anomaly events in real time, ensures the complete collection of critical fault information, helps to shorten the time from the occurrence of an error to the discovery of the fault, improves the system's ability to defend against sudden memory failures, and reduces the risk of system crashes caused by the expansion of the fault.

[0031] For example, in a computer device using an AMD architecture CPU, the CPU 103 has an internal ADDC as a hardware debugger. This hardware debugger is integrated into a dedicated hardware logic block (hard core) within the CPU 103, as part of the SMU (System Management Unit) or MPRAS (Microprocessor for RAS), and is isolated from and operates independently of the CPU 103. The CPU has one or more dedicated error reporting buses (such as RAS buses) that broadcast error information when the memory controller or other units detect errors. The memory controller's registers may reside in the memory controller's address space, which is mapped to the CPU's unified address space and accessible through the processor's internal interconnect structure. The hardware debugger, as a device within the CPU, can directly read the values ​​of these registers through this internal interconnect; this access does not involve the CPU core and is independent of the operating system. The hardware debugger contains hardware comparators and threshold registers. These circuits continuously compare the values ​​read from the memory controller's registers (such as CE counts) with preset thresholds customized based on prior knowledge or actual application requirements. The comparison operation is performed by dedicated circuitry without software intervention. Once the threshold is exceeded, the comparator immediately triggers a hardware event. Correspondingly, the hardware debugger is connected to the memory controller via the internal interconnect of the central processing unit 103 or the error reporting bus. The hardware debugger is further configured to: when it detects an error signal sent by the memory controller via the error reporting bus, read the register data of the memory controller, or actively read the register data of the memory controller via the internal interconnect. The threshold register of the hardware debugger is configured to: if the detected memory error information in the register data meets a preset threshold condition, generate a hardware trigger signal, and generate a hardware event signal conforming to the out-of-band management bus transmission format based on the hardware trigger signal, and send the hardware event signal to the management controller 102 via the out-of-band management bus.

[0032] In this embodiment, the hardware debugger continuously monitors the registers of the memory controller and determines whether preset threshold conditions are met through an internally configured threshold register. The preset threshold conditions are one or more trigger threshold values ​​predefined based on prior knowledge or the actual scenario, located in the registers corresponding to correctable errors on the memory controller. These threshold values ​​can be the cumulative number of correctable errors within a unit time window or the number of consecutive correctable errors occurring at the same physical memory address. For example, it can be set to a CE count greater than 10 within one minute. This comparison is performed in real-time by a dedicated comparator circuit. When the register value meets the preset threshold conditions, the comparator circuit immediately generates an internal hardware trigger signal. This signal is not sent to the CPU 103 interrupt but directly to the ADDC's own control logic. Upon receiving the trigger signal, the ADDC control logic locks the data for several clock cycles to prevent the contents of the relevant error register group from being overwritten by new errors. The contents of a complete set of predefined debug registers are then read out via the internal bus. This data set includes not only the CE register that triggered the event, but also potentially related address registers, timestamps, processor context states, etc., forming a complete error snapshot. This snapshot data is written to a dedicated buffer in ADDC, which BMC can access via APML. After data storage is complete, ADDC will proactively generate an event notification message packet conforming to the APML protocol format through its integrated APML engine and send it to BMC via APML.

[0033] Considering that if the data acquisition cycle cannot be flexibly adjusted according to actual needs, there may be insufficient acquisition frequency for data sources with high failure rates, while there may be excessive acquisition for stable data sources, leading to unreasonable resource allocation, this embodiment also provides an implementation method for adjusting the data acquisition cycle based on a time-driven signal, which may include the following: The management controller 102 integrates at least one timer, such as a watchdog timer or a general-purpose timer. The timer is configured to send a timing signal indicating that the polling period value of the target data source has been modified via the internal bus or interrupt controller of the management controller 102. The management controller 102 is also configured to adjust the initial data acquisition parameters according to the new timing value, and the first process determines the next acquisition task scheduled by the target data source based on the new timing value.

[0034] The target data source is a specific data source whose collection cycle needs to be adjusted, such as an error information collection channel for a memory module. The polling cycle value is the time interval between two consecutive collections from the target data source. The scheduling time is the planned time point for the next collection task from the target data source.

[0035] In this embodiment, the management controller 102 integrates timers such as watchdog timers and general-purpose timers, which are driven by the clock of the management controller 102 and run independently. When the timer count reaches a preset value, a timer interrupt signal can be generated directly to the processor core of the management controller 102 via its internal bus or interrupt controller. After loading the initial data acquisition parameters, the management controller 102 sets a corresponding polling timer for each target data source. During the data acquisition task executed by the first process according to the initial cycle value, the management controller 102 continuously monitors for the presence of time-driven signals. When it receives a signal indicating that the polling cycle value of a target data source has been modified, it extracts the new timing value from the signal, updates the polling timer value corresponding to the target data source, and reschedules the next acquisition task according to the new timing value to ensure that the acquisition cycle meets the adjustment requirements. After the new scheduling time is reached, the first process performs the acquisition operation for the target data source, and continues to cycle according to the updated cycle after the acquisition is completed.

[0036] As can be seen from the above, this embodiment supports flexible adjustment of the data source acquisition cycle, thereby enabling optimization of the acquisition cycle based on factors such as the stability and failure risk of the data source. For data sources with high failure rates, the cycle can be shortened to increase monitoring density, while for stable data sources, the cycle can be extended to reduce resource consumption, thus achieving reasonable resource allocation and improving acquisition efficiency.

[0037] Considering that data acquisition operations consume significant resources and cause service disruptions when the system is under high load, and that insufficient data acquisition frequency may affect fault monitoring effectiveness when the system is under low load, this embodiment also provides a method for adjusting the data acquisition frequency based on system resource load, which may include the following: The performance monitoring unit of the central processing unit 103 is configured to send load status report data to the management controller 102 via the out-of-band management bus; the management controller 102 is also configured to: when receiving load status report data through the out-of-band interface, determine the new data acquisition frequency value according to the load-frequency mapping relationship, and update the initial data acquisition parameters accordingly.

[0038] The load status report data reflects system resource usage, including metrics such as CPU utilization, memory usage, and bus load. The load-frequency mapping relationship is a pre-defined rule connecting system load and data acquisition frequency. It's a lookup table or linear function pre-stored in the management controller, defining the data acquisition frequency for different load ranges. For example, when CPU utilization exceeds 85%, the data acquisition frequency is halved; when it's below 30%, the frequency is doubled. The new data acquisition frequency value is an adjusted value calculated based on the load status report data and the mapping relationship.

[0039] In this embodiment, the performance monitoring unit inside the central processing unit 103 collects internal data of the central processing unit, such as CPU utilization, memory usage, and core frequency, periodically generates load status report data, and sends it to the management controller 102. The performance monitoring unit is used to monitor the instruction execution rate, cache hit rate, and thread activity status of the central processing unit in real time. After receiving the load status report data, the management controller 102 parses the load indicators and compares them with a preset load-frequency mapping relationship. Based on the comparison result, it determines the corresponding new data acquisition frequency value. For example, when the CPU utilization exceeds 85%, the acquisition frequency is adjusted to 50% of the original; when the CPU utilization is below 30%, the data acquisition frequency is adjusted to 200% of the original. The management controller 102 updates the data acquisition frequency in the initial data acquisition parameters, and the first process performs subsequent acquisition operations according to the new frequency.

[0040] As can be seen from the above, this embodiment achieves dynamic adaptation between the acquisition frequency and the system load, avoids the acquisition operation and business operation competing for resources, ensures the smooth operation of the system under high load, and increases the acquisition frequency under low load to ensure the comprehensiveness of fault monitoring, taking into account both system performance and monitoring effect.

[0041] Based on the above embodiments, this embodiment also provides a method for consistently scaling the collection frequency of all data sources when the system load changes, which may include the following: The management controller 102 is also configured to: determine the global frequency adjustment coefficient based on the resource status signal and the load-frequency mapping relationship; obtain the initial data acquisition frequency of each data source for the initial data acquisition parameters; adjust the new data acquisition frequency of each data source according to the global frequency adjustment coefficient, and update the initial data acquisition parameters accordingly.

[0042] The global frequency adjustment coefficient is a globally applicable frequency adjustment ratio calculated based on resource status signals. It is used to uniformly adjust the acquisition frequency of all data sources. This coefficient is a temporary global adjustment factor and does not override the independent baseline frequency set by other strategies (such as error rate-based adjustments) for each data source. Instead, it is adjusted by adding values ​​to the baseline frequency. When returning to normal load conditions, the global frequency adjustment coefficient can be restored to 1. The initial data acquisition frequency of each data source refers to the data acquisition frequency currently used by the first process.

[0043] In this embodiment, the management controller 102 calculates a global frequency adjustment coefficient based on the received resource status signal (such as the overall average CPU utilization) and a preset load-frequency mapping relationship. For example, the coefficient is 0.5 under high load (halving all sampling frequencies), 1 under normal load, and 1.5 under low load. The initial data sampling frequencies of each data source are obtained from the current initial data sampling parameters. The initial data sampling frequency of each data source is multiplied by the global coefficient to obtain the new data sampling frequency, which is then updated back into the sampling parameters.

[0044] In this embodiment, by introducing a global adjustment coefficient, it is possible to respond quickly and consistently to system-level load fluctuations with extremely low computational overhead, achieve global throttling or recovery of monitoring intensity, and quickly adapt to changes in the overall system load. While ensuring the effectiveness of fault monitoring, it minimizes the impact of data acquisition operations on system performance and improves system resource utilization.

[0045] To reduce the difficulty of system configuration and maintenance, improve the comprehensiveness and accuracy of memory fault diagnosis, and enable flexible customization of data collection behavior according to different hardware configurations and operation and maintenance strategies, based on the above embodiments, this embodiment also provides an exemplary method for configuring initial data collection parameters, which may include the following: The management controller 102 is connected to the non-volatile memory via a hardware interaction bus; the target storage location of the non-volatile memory stores the data source acquisition strategy table, which includes at least the data source identifier and acquisition configuration parameters; the management controller 102 is also configured to: during the initialization process, read the data source acquisition strategy table from the non-volatile memory, and configure the corresponding acquisition parameters for each data source according to the data source acquisition strategy table, as the initial data acquisition parameters.

[0046] The hardware interaction bus refers to the physical connection line between the management controller 102 and the non-volatile memory on the motherboard. For example, if the non-volatile memory is SPI (Serial Peripheral Interface) or Flash memory, the hardware interaction bus is the SPI bus. The target storage location is a non-volatile storage area storing the data source acquisition strategy table, such as a hard disk or flash memory. The data source acquisition strategy table records the configuration information of each data source and is updated in real time through out-of-band communication (such as APML) to ensure that the acquisition frequency adapts to the system state. It can at least include a data source identifier and acquisition configuration parameters. The data source identifier is a unique identifier used to distinguish different data sources, such as a memory bandwidth counter ID or a CPU temperature sensor ID. Acquisition configuration parameters refer to the configurable acquisition parameters of different data sources, such as acquisition mode, initial acquisition period, event level, and data acquisition frequency. Acquisition modes include event-triggered, polling, or mixed modes. The initial acquisition period is the default acquisition interval of the data source, and the event level includes immediate response, delayed response, and none. The data acquisition frequency is a frequency value that can be dynamically adjusted during runtime.

[0047] In this embodiment, during the system deployment phase, the user can configure a standardized data source acquisition strategy table based on memory hardware characteristics and operational requirements. The table clearly defines the data source identifier, acquisition mode, initial acquisition cycle, event level, data acquisition frequency, and other acquisition parameters for each data source. The defined data source acquisition strategy table is stored in the target storage location to ensure rapid readability upon system startup. During the first process initialization, the management controller reads the data source acquisition strategy table from the target storage location and parses the configuration information. Based on the contents of the strategy table, corresponding acquisition parameters are configured for each data source. These parameters are used as the initial data acquisition parameters for the first data acquisition task. During the data acquisition process of the first process, a low-power loop is initiated, adaptively adjusting the acquisition parameters for the next moment or the next round of data acquisition tasks based on monitored signals.

[0048] Furthermore, event levels can be adjusted based on error severity, such as being divided into high, medium, and low levels. System events are continuously monitored. When a high-level event is detected (such as an uncorrectable error interrupt or hardware fault signal), data collection is immediately triggered, for example, by directly reading the memory controller register to obtain the error address and type. For medium-level events (such as a sudden increase in correctable error count), the collection frequency is increased (e.g., shortened from 5 seconds to 1 second), and the trend is recorded through the BMC log. For low-level events or no events, if the error count is stable or the system load is low, the collection frequency is gradually decreased (e.g., extended to 10 seconds), with the decrease not exceeding 50% to avoid missed detections.

[0049] As can be seen from the above, this embodiment separates the collection logic from the configuration data by using an external data source collection strategy table. Operation and maintenance personnel can adjust the monitoring scope, frequency, and response method by modifying the data source collection strategy table without modifying the program code. This makes the system highly configurable and adaptable, easy to deploy on different types of servers or in different application scenarios, reduces the difficulty of system configuration and maintenance, and improves the scalability of the system.

[0050] For example, this embodiment also provides an adaptive data acquisition process, such as... Figure 2 As shown, it may include the following: A1: When the system powers on or the module starts, the preset data source acquisition strategy table is loaded from the non-volatile memory.

[0051] A2: Enter the main listening loop and collect data periodically according to the pre-set timer cycle.

[0052] The main listening loop is a low-power infinite loop that simultaneously listens for three types of inputs: hardware event interrupts (high priority); polling cycle signals generated by an internal timer; and load reports from the system performance monitoring unit.

[0053] A3: Response steps for high-priority paths.

[0054] This step handles event-triggered scenarios. When the hardware sends an immediate data collection interruption trigger condition, such as when the number of memory errors exceeds the threshold set according to the fault handling accuracy requirements of the current running business, the corresponding response action is: suspend the current non-critical task and immediately schedule a targeted collection of data related to the event.

[0055] A4: Response steps for regular priority paths.

[0056] This step handles periodic polling scenarios. When the trigger condition of receiving a modified polling timer value from a data source is received, the corresponding response action is to determine the collection period according to the newly modified value.

[0057] A5: Response steps for adjusting the path in the background.

[0058] This step handles system load awareness scenarios. When a periodic system load report is received as a trigger, such as CPU utilization > 85%, the corresponding response is as follows: Based on the preset load-frequency mapping relationship, a global frequency adjustment coefficient is calculated. This coefficient does not immediately override the frequency of a single data source, but rather serves as a benchmark, which is then adjusted by superimposing it on the data acquisition frequency of each data source to achieve global throttling or acceleration. For example, when the temperature drops, i.e., returns to normal, the global frequency adjustment coefficient reverts to 1.

[0059] A6: Updates and Scheduling.

[0060] After each data collection and response operation is completed, the relevant fields in the data source collection strategy table of A1 are updated, such as the current frequency, the last collection timestamp, and the data status. Then, the main listening loop of A2 continues to execute, waiting for the next event or periodic signal, so as to achieve continuous and dynamic data collection and management.

[0061] Based on the above embodiments, this embodiment also provides a method for ensuring uninterrupted memory monitoring when upgrading, patching, or debugging the first or second process during the system operation of a computer device. This method may include the following: The management controller 102 is also configured to perform corresponding operations on the first process when it receives an operation request to restart or update the first process, while the second process continues to execute memory fault prediction.

[0062] The restart operation involves shutting down the first process and restarting it to apply new configurations or fix process anomalies. The update operation upgrades and optimizes the program code and configuration parameters of the first process. When the first process needs to be restarted or updated, such as adding support for new types of error registers, the system receives relevant operation instructions and initiates the corresponding processing flow. If a restart operation is performed, the system shuts down the currently running first process, while ensuring that the second process continues to perform memory fault prediction based on collected historical data and cached data and / or waiting for new data, unaffected by the restart. After the first process restarts, it automatically loads the latest collection parameters, resumes collection operations, and continuously sends newly collected memory error information to the second process through the data pipeline. If an update operation is performed, such as using a canary release method, a portion of the first process is updated first to verify that the updated collection function is normal and does not affect the second process, and then all collection processes are updated gradually, ensuring zero interruption to the second process throughout the process.

[0063] As can be seen from the above, this embodiment is based on two independent processes. When the first process is restarted and updated independently, the second process works without interruption, which ensures the continuity of memory fault monitoring and prediction functions, reduces the risk of failure during system maintenance and upgrade, improves the maintainability of the system and the availability of online services, and meets the high availability requirements of data centers.

[0064] Based on the above embodiments, this embodiment also provides a method for preventing the loss of memory error monitoring data and achieving rapid recovery when the second process crashes unexpectedly due to software defects, resource exhaustion, or other reasons. This method may include the following: The management controller 102 also includes a storage area to be synchronized and updated. The first process and the second process communicate with each other through a data pipe. The management controller 102 is also configured to: when the second process terminates abnormally, the first process continues to collect memory error information and cache it in the storage area to be synchronized and updated; when the second process is detected to have recovered, the first process synchronizes the memory error information in the storage area to be synchronized and updated to the second process through the data pipe.

[0065] Abnormal termination refers to the second process stopping due to program errors, resource exhaustion, or other reasons. The synchronization update storage area is a dedicated storage area for caching memory error information, temporarily storing collected data when the second process malfunctions. In this embodiment, the system can monitor the running status of the second process in real time, for example, by using heartbeat detection to determine if it is working normally. When abnormal termination of the second process is detected, the first process is unaffected and continues to collect memory error information from the memory controller according to the current collection parameters. To avoid data loss, the first process temporarily caches newly collected memory error information in the preset synchronization update storage area, and can also record the timestamp and related attributes of the cached data. The system initiates the recovery of the second process, such as automatically restarting the second process or notifying maintenance personnel for manual recovery. When the second process is detected to have resumed normal operation, such as being restarted by the watchdog process, the first process actively reads the cached memory error information from the synchronization update storage area, synchronizes it to the newly started second process through the data pipeline, and simultaneously resumes real-time data transmission, enabling it to continue analysis and prediction based on the complete data sequence.

[0066] As can be seen from the above, the first process in this embodiment can still run independently and cache data when the second process is abnormal, ensuring the continuous collection and storage of memory error information and avoiding data loss; after the process recovers, the cached data can be quickly synchronized to ensure the continuity and integrity of fault prediction, improve the fault tolerance and reliability of the system, and enhance the robustness of the entire memory fault handling system.

[0067] The above embodiments provide an implementation process for triggering fault repair operations in different scenarios when a memory fault is predicted, which may include the following: For example, the fault repair operation is a page isolation operation, which marks memory pages with fault risks as offline to prevent the system from continuing to use the page and causing the fault to escalate. In this embodiment, the page isolation operation is performed by the BIOS and OS running on the central processing unit 103. The management controller 102 is also configured to: when the fault repair operation triggered by the second process is a faulty page isolation operation, trigger the generation of a system control interrupt signal and send the system control interrupt signal to the central processing unit 103; the central processing unit 103 runs the operating system and the basic input / output system, which is configured to: respond to the system control interrupt signal through the system management interrupt handler, obtain the page information to be isolated through the out-of-band management bus, generate error record data containing the translated system physical address and error severity information, and push the error record data to the target hardware error source; the operating system performs a memory page offline operation on the page to be isolated based on the error record data.

[0068] The system control interrupt signal is the interrupt signal that triggers the BIOS to perform fault handling, initiated by the management controller. The system management interrupt handler is a program module in the BIOS used to handle system management interrupts and can interact with the management controller to obtain fault information. The page to be isolated information is related to the memory page that needs to be offline, including physical address, error severity, etc. Error log data contains records of memory error-related information, including translated system physical address and error severity information, such as CPER. The target hardware error source is used to receive and transmit error log data, such as GHES. Page offlining removes memory pages at risk of failure from the system's available memory to prevent their continued use.

[0069] For example, in this embodiment, the BMC triggers an SCI (System Control Interrupt). This SCI signal is received by the CPU, and the system management interrupt handler in the CPU's basic input / output firmware responds to this interrupt, entering system management mode. In system management mode, the BIOS program interacts with the BMC, for example through a specific shared memory region or OOB command, to obtain information about the page to be isolated. This information can be the system physical address predicted by the BMC. The BIOS generates an ACPI-compliant error record, namely a memory CPER (Common Platform Error Record), which contains the system physical address translated by the BIOS and the severity information of the error. The BIOS pushes the CPER record to the target hardware error source, namely the GHES driver in the operating system kernel. The OS recognizes the memory CPER and, without requiring an operating system patch, performs a page offlining process.

[0070] As can be seen from the above, this embodiment achieves secure and seamless integration between out-of-band prediction results and the operating system's memory management. The entire process is transparent to the operating system, has strong compatibility, and does not require the development of dedicated kernel drivers for different OS versions, thereby improving system stability and business continuity.

[0071] For example, whether it's a fault repair operation or a page isolation operation, this embodiment targets a computer device with an independent processor that performs error handling, diagnosis, and recovery operations via an out-of-band path. This independent processor is defined as an error processor, which is connected to the central processing unit 103 of the computer device. For instance, for a computer device using the AMD Venice platform, such a computer device has an MPRAS (Microprocessor for Reliability, Availability, Serviceability). In this scenario, the error processor uses an MPRAS. The MPRAS is a dedicated RISC-V (Reduced Instruction Set Computer-V) microcontroller inside an AMD EPYC (server processor name) server, responsible for RAS tasks originally handled by the x86 system management mode, such as... Figure 3As shown, MPRAS takes over the RAS tasks originally handled by the x86 system management mode. In non-system management mode, the system management unit and MPRAS communicate and coordinate through an internal interface, utilizing MPRAS to perform error handling tasks more efficiently and with less interference. After the introduction of MPRAS, page isolation technology is also performed by MPRAS. The error handler performs memory page offline operations on the pages to be isolated based on the fault location information sent by the system management unit, which can reduce the occurrence of SMI and improve system efficiency and reliability. Accordingly, the page isolation operation can be performed by the MPRAS microprocessor. In this computer device, the management controller 102 is further configured to: when the fault repair operation triggered by the second process is a fault page isolation operation, determine the memory fault location information, and send the memory fault location information to the system management unit of the central processing unit 103 through an out-of-band interface; the memory fault location information includes memory row and column address information; the error processor is configured to: receive the memory fault location information forwarded by the system management unit, translate the memory row and column address information into memory page address information, generate error record data containing memory page address information and error severity information, write the error record data to a reserved error block buffer, and trigger the generation of a system interrupt signal; the operating system running on the central processing unit 103 receives the system interrupt signal, calls the target hardware error source driver to read and parse the error record data from the error block buffer, and the operating system performs a memory page offline operation on the page to be isolated based on the error record data.

[0072] Among these, memory fault location information refers to the location data of the memory fault, including memory row and column address information, which are the row and column coordinates identifying the memory fault. The out-of-band interface (OOB) is a communication interface independent of the main system bus, used for out-of-band management and data transmission. The system management unit (SMU) is a security and coordination management subsystem integrated within the central processing unit, responsible for tasks such as power consumption regulation and temperature control, and can forward fault information. Memory page address information identifies the address data of memory pages in the system, used by the operating system to locate and isolate faulty pages.

[0073] Taking the management controller (BMC) and the error handler (MPRAS) as an example, the BMC's second process predicts persistent faults and can instruct page isolation operations. The BMC predicts and determines the memory page addresses that need to be taken offline. The BMC sends fault location information, including memory row and column addresses, to the SMU FW (firmware) via the OOB interface, which then forwards it to the MPRAS FW. The MPRAS FW translates the received memory row and column address information into operating system-recognizable memory page address information. The memory page address information is the system physical address information. It generates error log data containing this memory page address information and the fault severity, such as a memory CPER. Finally, the memory CPER containing the system physical address and severity information is pushed to GHES. The OS recognizes the memory CPER and, without requiring an operating system patch, executes the page offlining process.

[0074] As can be seen from the above, this embodiment bypasses SMM (System Management Mode), eliminating the performance stalls and potential security risks caused by SMI. By performing fault page isolation through an out-of-band path and a dedicated microprocessor, it achieves lower latency and higher security memory fault isolation. It does not require the main system's CPU resources, avoiding the impact of system management interruptions on system performance. The out-of-band path is independent of the main system, so even if the main system is in an unstable state, fault page isolation can still be completed normally, improving the reliability and timeliness of the isolation operation, further ensuring stable system operation, and is suitable for data center servers with extremely high performance and security requirements.

[0075] For example, the fault repair operation is a Runtime PPR (Runtime Packaged Repair Operation). Runtime PPR implements memory fault stealth and self-repair at the hardware level. The central processing unit 103 of the computer device may also include a coprocessor to execute the Runtime PPR. For instance, for a computer device using an AMD architecture, the coprocessor could be an AMD Security Coprocessor. The management controller 102 is further configured to: when the fault repair operation triggered by the second process is a Runtime Packaged Repair Operation, generate a Packaged Repair Operation request containing the faulty physical address, and send the Packaged Repair Operation request to the system management unit of the central processing unit 103 via the out-of-band management bus; the system management unit sends the Packaged Repair Operation request to the coprocessor, and the coprocessor executes the Packaged Repair Operation request in a trusted execution environment.

[0076] For example, in a scenario where the central processing unit 103 adopts a first-type processor architecture, the first-type processor architecture refers to a processor with an autonomous debugging data collection function such as ADDC, such as an architecture represented by AMD processors. Correspondingly, the management controller 102 is also configured to: configure the monitoring parameters of the autonomous debugging data collection hardware module built into the central processing unit 103 through the out-of-band management bus. When the memory controller of the central processing unit 103 detects memory error information, the autonomous debugging data collection hardware module automatically captures the memory error information, and the first process obtains the memory error information from the autonomous debugging data collection hardware module through the out-of-band management bus according to the initial data acquisition parameters.

[0077] In this embodiment, the first process of the management controller 102 is a first process based on the autonomous debugging data collection function. This out-of-band first process collects memory error information. The so-called out-of-band first process refers to the collection process implemented based on the ADDC function, which can bypass the central processing unit 103, operating system, and system firmware to perform collection operations. The first process of the management controller 102 sends the memory error information to the second process. The second process predicts memory faults and triggers fault repair operations when a memory fault is predicted. Thus, based on the existing ADDC out-of-band autonomous acquisition framework on the AMD platform, the technical solution of this invention achieves functionality similar to the MRT in the Intel architecture.

[0078] For this type of memory fault handling system, such as Figure 4 As shown, the central processing unit 103 includes an ASP (AMD Secure Processor), an SMU, a UMC (Unified Memory Controller), and various functional modules for performing page isolation operations. Data is read from memory 101, and the memory controller of the central processing unit 103 calculates the validity of the data. When the data verification fails, the relevant registers of the memory controller are updated, and error information is recorded, such as the specific location and physical address of the memory error. The management controller 102 monitors the relevant registers of the memory controller in real time through APML out-of-band. The management controller 102 records historical fault address information and can also control the display of corresponding fault information through a webpage.

[0079] In practical applications, a memory fault handling system may include a fault adaptive acquisition unit, a dual-process processing unit, and a fault handling unit. The fault adaptive acquisition unit is composed of memory 101, management controller 102, central processing unit 103, and ADDC process, each executing computer program instruction segments related to data acquisition. For example, the computer program instruction segments executing data acquisition-related functions in the ADDC process can constitute an adaptive data acquisition module. ADDC does not require developer intervention; it is developed by AMD. Developers only need to perform minor configuration and add pipes to the ADDC process to pass messages, significantly reducing the development cycle. The central processing unit 103, through its multiple internal microprocessors, such as ASP and MP1 (Power Management Microprocessor), works independently and silently in the background, undertaking a series of critical tasks from power-on, power consumption regulation, performance improvement, temperature control to system security. The adaptive data acquisition module maintains a data source acquisition strategy table, receiving event interrupts from hardware, polling cycle signals generated by internal timers, and system load information, such as... Figure 2 As shown, data acquisition parameters such as acquisition period, event level, and data acquisition frequency are configured or adjusted for the data source being collected. The dual-process processing unit isolates the first and second processes, executing tasks related to both processes. This allows the first process to be restarted and canary deployed independently, while the second process remains uninterrupted. For example, if the second process crashes due to an unknown error, the underlying ADDC process can still run independently, ensuring uninterrupted monitoring and providing a foundation for system recovery. If the first process requires frequent script modifications, adding measurement points, or changing protocols, the fault prediction task execution is not interrupted. Process isolation achieves excellent fault tolerance and maintainability while significantly reducing development time. The fault handling unit is triggered by the prediction processing process and can perform fault page isolation and Runtime PPR operations. When the prediction processing process is triggered, it can instruct the fault handling unit to perform fault modification operations, such as page isolation or Runtime PPR. AMD platform Runtime PPR relies on ASP, leveraging ASP's independence, security, and reliability to achieve hardware-level stealth self-healing.

[0080] As can be seen from the above, this embodiment utilizes the ADDC function built into the AMD processor to construct the acquisition process, reducing development costs and realizing predictive fault handling functions similar to Intel MRT; the dynamic acquisition strategy avoids performance loss and improves the memory reliability and availability of AMD architecture devices.

[0081] For example, in a computer device employing a Type I processor architecture, if the device also has an error processor that performs error handling, diagnosis, and recovery operations via an out-of-band path, such as AMD's Venice platform, then MPRAS is the error processor that takes over the system's RAS tasks. With the introduction of MPRAS, page isolation technology is correspondingly handled by MPRAS. The error processor, based on the fault location information sent by the system management unit, performs memory page offline operations on the pages to be isolated, which can reduce the occurrence of SMIs (System Injection Mistakes). Figure 5 As shown, in this application scenario, the fault adaptive acquisition unit is composed of memory 101, management controller 102, central processing unit 103, microprocessor, and ADDC process, each executing computer program instruction segments related to data acquisition. When the predictive processing process triggers the execution of page isolation operation, the central processing unit offloads the task to the microprocessor.

[0082] As can be seen from the above, this embodiment performs fault page isolation through an independent error handler, avoiding the impact of system management interruptions on the performance of the main system and improving the efficiency of isolation operations; the out-of-band path is independent of the main system, so even if the main system is unstable, isolation can still be completed normally, enhancing the reliability and timeliness of fault handling.

[0083] In response to the aforementioned memory fault handling system, this invention also provides a corresponding memory fault handling method, please refer to [link / reference needed]. Figure 6 According to a memory fault handling method provided by the present invention, it can be implemented as a computer program product, installed and run in the out-of-band processor (such as the baseboard management controller) of each node device using an AMD architecture in a data center server cluster, for implementing a predictive processing and adaptive repair process for server memory faults. In some embodiments of the method, the out-of-band management hardware device, or out-of-band processor, is a management controller, such as a BMC. The process of the management controller performing low-interference and high-reliability memory fault prediction and fault repair may include the following steps: S601: Construct a first process and a second process that are isolated from each other.

[0084] S602: During the process of collecting memory error information using the first process according to the initial data acquisition parameters, listen to the acquisition parameter adjustment trigger signal, and adjust the initial data acquisition parameters accordingly based on the signal type of the acquisition parameter adjustment trigger signal.

[0085] S603: The first process sends memory error information to the second process, which then predicts memory failures. When the second process predicts a memory failure, it triggers a fault repair operation.

[0086] The process by which the management controller performs low-interference and highly reliable memory fault prediction and fault repair can be found in the implementation process described in the above embodiments, and will not be repeated here.

[0087] For Intel-based computer devices, in practical applications, the Memory Management Response (MRT) is integrated into these devices via BIOS or UEFI (Unified Extensible Firmware Interface) firmware, combined with the open-source BMC stack. This enables out-of-band management, allowing continuous operation even when the host operating system is unresponsive, effectively ensuring the memory reliability of the computer device. The MRT utilizes machine learning models to analyze memory error patterns, enabling predictive intervention for potential failures. It shifts the focus from reactive remediation to proactive and even preventative handling, significantly improving the reliability and serviceability of Intel-based servers through this proactive and intelligent memory fault management.

[0088] Intel-based computer devices, when using the MRT for memory management, utilize the RAS (Reliability, Availability, Serviceability) offload function during out-of-band error data acquisition to offload certain error handling to the BMC, minimizing SMI (System Management Interrupt) usage. SMI is a global event, or broadcast event. Once SMI is used, all CPU cores are suspended, and all CPU threads immediately enter SMM mode after completing their current instruction. Once in SMM mode, threads are unusable by the operating system until released back to the operating system by the SMM handler. Because the SMM code is invisible to the OS, the OS cannot verify the SMM handler. Within SMM, resources locked by the operating system can be manipulated, which is unsafe. Furthermore, because SMI needs to address asynchronous and synchronous issues, the SMM handler needs to handle different types of interrupts, such as MCA (Machine Check Architecture), CSMI (Chipset System Management Interrupt), and MSMI (Memory System Management Interrupt), increasing the complexity of SMI. In severe cases, improper handling can cause a system kernel panic. RAS offload also uses BMC as a carrier to handle correctable error faults. Based on this, memory inspection, fault collection, diagnostic logging, and predictive maintenance can be completed with zero or very low main computing resource usage. However, frequent data acquisition can affect out-of-band processor performance: First, PECI exclusive window blocking multiplexing, the current MRT reads the MCA or MCU (Microcontroller Unit) registers inside the CPU through the PECI command RdPkgConfig() (command name) at a fixed period of 1s or 2s to obtain CE (Correctable Error) count data. The PECI bus is narrow-band half-duplex. Polling the CE register at a fixed period will create exclusive time slots, blocking critical telemetry paths such as temperature, power consumption, and fan speed, and will also introduce control jitter and timeout reset risks.Secondly, when performing PECI communication, the CPU must keep certain circuits powered on, such as the PECI clock, to prevent the Package from entering C1E (CPU power saving status indicator) deep sleep. The measured out-of-band power of a single node increases by 0.6W, and the annual power consumption of a cluster of 100,000 units increases by 525MWh, equivalent to 420tCO2, which does not meet the relevant energy consumption requirements of data centers.

[0089] In view of this, in order to solve the problems existing in the memory fault handling process of this type of computer equipment, this invention can build a first process using the existing RAS offload module out-of-band autonomous acquisition framework on the Intel platform and trigger the fault repair operation. The computer program of the above memory fault handling method is installed and run in the management controller of this type of computer equipment, enabling the management controller to adopt a dual-process communication method, optimize the MRT architecture, and use a dynamic and adaptive sampling method to completely eliminate high-frequency interference to the CPU and memory subsystems, and significantly reduce the storage and network bandwidth resources on the BMC side. This achieves both reduced development costs and reduced impact of frequent data acquisition on out-of-band processor performance.

[0090] It should be noted that there is no strict order of execution between the steps in this invention. As long as they conform to the logical order, these steps can be executed simultaneously or in a certain preset order. Figure 2 and Figure 6 This is just an illustrative example and does not mean that this is the only possible execution order.

[0091] This invention also provides a corresponding apparatus for memory fault handling methods, further enhancing the practicality of the method. The apparatus can be described from both a functional module perspective and a hardware perspective. The memory fault handling apparatus provided by this invention is described below. This apparatus is used to implement the memory fault handling method provided by this invention. In this embodiment, the memory fault handling apparatus may include or be divided into one or more program modules. These one or more program modules are stored in a storage medium and executed by one or more processors to complete the memory fault handling method disclosed in the embodiment. The program module referred to in this embodiment refers to a series of computer program instruction segments capable of performing a specific function, which is more suitable than the program itself for describing the execution process of the memory fault handling apparatus in the storage medium. The following description will specifically introduce the functions of each program module in this embodiment. The memory fault handling apparatus described below can be referred to in correspondence with the memory fault handling method described above.

[0092] From the perspective of functional modules, please refer to Figure 7 , Figure 7This is a structural diagram of the memory fault handling device provided in this embodiment under one specific implementation. The device may include: Process building module 701 is used to build a first process and a second process that are isolated from each other.

[0093] The adaptive acquisition module 702, while acquiring memory error information using the first process according to the initial data acquisition parameters, listens for the acquisition parameter adjustment trigger signal and adjusts the initial data acquisition parameters accordingly based on the signal type of the acquisition parameter adjustment trigger signal.

[0094] The fault prediction and repair module 703 is used to send memory error information from the first process to the second process, and use the second process to predict memory faults. When the second process predicts a memory fault, it triggers a fault repair operation.

[0095] For example, in some embodiments of this example, the adaptive acquisition module 702 described above can also be used for: Upon detecting an event-driven signal via the out-of-band management bus, the current acquisition task is interrupted, and at least one data acquisition operation is performed on the corresponding hardware at a first preset time. Upon detecting a time-driven signal, the data acquisition period in the initial data acquisition parameters is updated according to the time-driven signal to adjust the scheduling time of the next acquisition operation. Upon detecting a resource status signal, the data acquisition frequency in the initial data acquisition parameters is updated according to the resource status signal.

[0096] As an exemplary implementation of the above embodiments, the adaptive acquisition module 702 can be further configured to: receive a hardware event signal through an out-of-band interface, generate a hardware event interrupt signal, and when a hardware event interrupt signal is detected, pause the current acquisition task and immediately acquire the memory controller register data of the target hardware corresponding to the hardware event interrupt signal. The hardware event signal is a signal generated by the central processing unit (CPU) based on a hardware trigger signal, conforming to the out-of-band management bus transmission format, and transmitted through the out-of-band management bus.

[0097] As another exemplary implementation of the above embodiments, the adaptive acquisition module 702 can be further used to: send a timing signal indicating that the polling cycle value of the target data source has been modified via an internal bus or interrupt controller, adjust the initial data acquisition parameters according to the new timing value, and determine the next acquisition task scheduled by the target data source according to the new timing value.

[0098] As another exemplary implementation of the above embodiments, the adaptive acquisition module 702 can be further used to: when receiving load status report data through the out-of-band interface, determine the new data acquisition frequency value according to the load-frequency mapping relationship, and update the initial data acquisition parameters accordingly.

[0099] As an exemplary implementation of the above embodiments, the adaptive acquisition module 702 can be further used to: determine the global frequency adjustment coefficient based on the resource status signal and the load-frequency mapping relationship; obtain the initial data acquisition frequency of each data source for the initial data acquisition parameters; adjust the new data acquisition frequency of each data source according to the global frequency adjustment coefficient, and update the initial data acquisition parameters accordingly.

[0100] For example, in some other embodiments of this example, the above-mentioned device may further include a process processing module, which can be used to: when a restart or update of the first process is received, and during the memory fault prediction process of the second process, perform corresponding operations on the first process.

[0101] For example, in some other embodiments of this embodiment, the above-mentioned device may further include a process processing module, which may also be used to: when the second process terminates abnormally, the first process continues to collect memory error information and cache it in the storage area to be synchronized and updated; when the second process is detected to have recovered, the first process synchronizes the memory error information in the storage area to be synchronized and updated to the second process through a data pipeline.

[0102] For example, in some other embodiments of this embodiment, the above-described apparatus may further include a data configuration module, which may be used to: read a data source acquisition strategy table from a target storage location when the first process is initialized; the data source acquisition strategy table includes at least a data source identifier and acquisition configuration parameters; and configure corresponding acquisition parameters for each data source according to the data source acquisition strategy table as initial data acquisition parameters.

[0103] For example, in some other embodiments of this embodiment, the fault prediction and repair module 703 described above can also be used to: trigger the generation of a system control interrupt signal, so that the basic input / output system responds to the system control interrupt signal through the system management interrupt handler and obtains the page information to be isolated by interacting with the management controller; the basic input / output system generates error record data containing the translated system physical address and error severity information, and pushes the error record data to the target hardware error source, so that the operating system performs memory page offline operation on the page to be isolated according to the error record data.

[0104] For example, in some other embodiments of this embodiment, the fault prediction and repair module 703 described above can also be used to: determine memory fault location information, the memory fault location information including memory row and column address information; send the fault location information to the system management unit through an out-of-band interface, and have the system management unit forward it to the error processor for performing error handling, diagnosis and recovery operations through the out-of-band path; the error processor translates the memory row and column address information into memory page address information, and generates error record data containing memory page address information and error severity information; push the error record data to the target hardware fault source, so that the operating system performs memory page offline operation on the page to be isolated according to the error record data.

[0105] The memory fault handling device mentioned above is described from the perspective of functional modules. Furthermore, the present invention also provides an electronic device, which is described from a hardware perspective. This electronic device may include a memory and a processor. The memory stores a computer program, and the processor is configured to run the computer program to perform the steps in any of the above-described memory fault handling method embodiments.

[0106] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above-described memory fault handling method embodiments when running.

[0107] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0108] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above-described memory fault handling method embodiments.

[0109] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described memory fault handling method embodiments.

[0110] Finally, this embodiment also provides an application example of the memory fault handling method in a data center. This data center includes a cluster of tens of thousands of hybrid servers based on AMD and Intel architectures, running various services such as database services and data storage. To avoid service interruptions or data corruption caused by memory faults, improve the energy efficiency of the data center, and enhance the reliability and availability of server memory, the memory fault handling system of this invention is deployed on all servers in the data center. Intel architecture servers use the RAS offload process as the first process, while AMD architecture servers use the ADDC process as the first process, both constructing a dual-process architecture and establishing a data pipeline. When the server starts, the first process loads a preset data source collection strategy table, collects memory error information according to initial parameters, and simultaneously listens for collection parameter adjustment trigger signals. During service operation, if the server's CPU utilization exceeds 85%, the first process receives a resource status signal, calculates a global frequency adjustment coefficient of 0.5, and halves the collection frequency of all data sources to avoid data collection operations competing for resources with services. If the number of memory errors on an AMD architecture server reaches a preset empirical threshold, a hardware event interrupt signal is triggered. The first process immediately suspends regular collection, selectively collects error data from that memory module, and sends it to the second process through the data pipeline. The second process analysis revealed an uncorrectable error risk in the memory module, immediately triggering a fault repair operation. Fault page isolation was performed via MPRAS, marking the faulty page as offline. Maintenance personnel received a fault alert through the data center management platform and replaced the memory module during off-peak hours (e.g., early morning). During this time, the server operated normally without affecting business operations. If the second process on an Intel architecture server crashes due to an unknown error, the first process continues to run independently, caching the memory error information to the synchronization storage area. After the maintenance personnel restart the second process, the cached data is quickly synchronized, and the fault prediction function returns to normal.

[0111] As can be seen from the above, this invention can optimize the MRT architecture by employing a dual-process communication method on the existing out-of-band autonomous acquisition framework of the RAS offload module on the Intel platform. It uses a dynamic and adaptive sampling method to completely eliminate high-frequency interference to the CPU and memory subsystems, and significantly reduces storage and network bandwidth resources on the BMC side. This achieves both reduced development costs and reduced impact of frequent data acquisition on out-of-band processor performance. This invention can also implement Intel MRT-like functions on the existing ADDC out-of-band autonomous acquisition framework on the AMD platform by employing a dual-process communication method. It uses a dynamic and adaptive sampling method to completely eliminate high-frequency interference to the CPU and memory subsystems, and significantly reduces storage and network bandwidth resources on the BMC side. This achieves both reduced development costs and reduced impact of frequent data acquisition on out-of-band processor performance.

[0112] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be performed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The systems, computer devices, or apparatuses described herein encompass a wide range of means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.

[0113] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed in this specification can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0114] The foregoing has provided a detailed description of a memory fault handling system, method, electronic device, and computer program product provided by the present invention. The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. Several improvements and modifications can be made to the present invention without departing from its principles, and these improvements and modifications also fall within the protection scope of the present invention.

Claims

1. A memory fault handling system, characterized in that, This includes memory, management controller, and central processing unit; The central processing unit is connected to the memory via a physical link and to the management controller via an out-of-band management bus; The memory stores at least system operating data; the central processing unit is configured to read the system operating data through the physical link and determine memory error information from the system operating data. The management controller is configured to: construct a first process and a second process that are isolated from each other; during the process of the first process collecting memory error information through the out-of-band interface according to the initial data acquisition parameters, listen to the acquisition parameter adjustment trigger signal, and adjust the signal type of the trigger signal according to the acquisition parameters, and adjust the initial data acquisition parameters accordingly. The second process receives memory error information sent by the first process, and predicts a memory failure based on the memory error information, triggering a fault repair operation.

2. The memory fault handling system according to claim 1, characterized in that, The management controller is also configured to: Upon detecting an event-driven signal via the out-of-band management bus, the current acquisition task is interrupted, and at least one data acquisition operation is performed on the corresponding hardware at a first preset time. The difference between the first preset time and the listening time of the event-driven signal is a preset minimum response time.

3. The memory fault handling system according to claim 2, characterized in that, The central processing unit is further configured to: when a hardware trigger signal is generated, generate a hardware event signal conforming to the out-of-band management bus transmission format according to the hardware trigger signal, and send the hardware event signal to the management controller through the out-of-band management bus; The management controller is also configured to: receive hardware event signals through an out-of-band interface, generate hardware event interrupt signals, and when a hardware event interrupt signal is detected, pause the current acquisition task and immediately acquire the memory controller register data of the target hardware corresponding to the hardware event interrupt signal.

4. The memory fault handling system according to claim 1, characterized in that, The central processing unit includes at least a memory controller; The memory controller is connected to the memory slot via pins and is configured to: read the system operating data from the memory, perform error detection and correction processing on the system operating data, and update the detected memory error information to the corresponding register.

5. The memory fault handling system according to claim 4, characterized in that, The central processing unit (CPU) also includes at least a hardware debugger; the hardware debugger is connected to the memory controller via an internal interconnect line or an error reporting bus of the CPU. The hardware debugger is configured to: when an error signal is detected sent by the memory controller through the error reporting bus, read the register data of the memory controller, or actively read the register data of the memory controller through the internal interconnection line; The threshold register of the hardware debugger is configured such that if a memory error in the register data meets a preset threshold condition, a hardware trigger signal is generated, and a hardware event signal conforming to the out-of-band management bus transmission format is generated based on the hardware trigger signal. The hardware event signal is then sent to the management controller via the out-of-band management bus.

6. The memory fault handling system according to claim 1, characterized in that, The management controller is also configured to: when a time-driven signal is detected, update the data acquisition period in the initial data acquisition parameters according to the time-driven signal, so as to adjust the scheduling time of the next acquisition operation.

7. The memory fault handling system according to claim 6, characterized in that, The management controller integrates at least one timer; The timer is configured to send a timing signal indicating that the polling cycle value of the target data source has been modified via the management controller's internal bus or interrupt controller. The management controller is also configured to: adjust the initial data acquisition parameters according to the new timing value, and schedule the next acquisition task scheduled by the target data source according to the new timing value.

8. The memory fault handling system according to claim 1, characterized in that, The management controller is also configured to update the data acquisition frequency in the initial data acquisition parameters based on the resource status signal when a resource status signal is detected.

9. The memory fault handling system according to claim 8, characterized in that, The performance monitoring unit of the central processing unit is configured to send load status report data to the management controller via the out-of-band management bus; The management controller is further configured to: when receiving the load status report data through the out-of-band interface, determine a new data acquisition frequency value based on the load-frequency mapping relationship, and update the initial data acquisition parameters accordingly.

10. The memory fault handling system according to claim 8, characterized in that, The management controller is also configured to: Based on the resource status signal and load-frequency mapping relationship, determine the global frequency adjustment coefficient; obtain the initial data acquisition frequency of each data source for the initial data acquisition parameters; adjust the new data acquisition frequency of each data source according to the global frequency adjustment coefficient, and update the initial data acquisition parameters accordingly.

11. The memory fault handling system according to claim 1, characterized in that, The management controller is connected to the non-volatile memory via a hardware interaction bus; The target storage location of the non-volatile memory stores the data source acquisition strategy table, which includes at least the data source identifier and acquisition configuration parameters. The management controller is further configured to: during the initialization process, read the data source acquisition strategy table from the non-volatile memory, and configure corresponding acquisition parameters for each data source according to the data source acquisition strategy table as initial data acquisition parameters.

12. The memory fault handling system according to any one of claims 1 to 11, characterized in that, The management controller is also configured to: When a request to restart or update the first process is received, the corresponding operation is performed on the first process while the second process continues to perform memory fault prediction.

13. The memory fault handling system according to any one of claims 1 to 11, characterized in that, The management controller also includes a storage area to be synchronized and updated, and the first process and the second process communicate with each other through a data pipe; The management controller is further configured to: when the second process terminates abnormally, the first process continues to collect memory error information and cache it in the storage area to be synchronized and updated; when the second process is detected to have recovered, the first process synchronizes the memory error information in the storage area to be synchronized and updated to the second process through the data pipeline.

14. The memory fault handling system according to any one of claims 1 to 11, characterized in that, The management controller is further configured to: when the fault repair operation triggered by the second process is a fault page isolation operation, trigger the generation of a system control interrupt signal and send the system control interrupt signal to the central processing unit; The central processing unit runs an operating system and a basic input / output system. The basic input / output system is configured to: respond to the system control interrupt signal through a system management interrupt handler, obtain the page information to be isolated through the out-of-band management bus, generate error log data containing the translated system physical address and error severity information, and push the error log data to the target hardware error source. The operating system performs memory page offline operation on the page to be isolated based on the error log data.

15. The memory fault handling system according to any one of claims 1 to 11, characterized in that, It also includes an error processor connected to the central processing unit and performing error handling, diagnosis, and recovery operations via an out-of-band path; The management controller is further configured to: when the fault repair operation triggered by the second process is a fault page isolation operation, determine the memory fault location information, and send the memory fault location information to the system management unit of the central processing unit through an out-of-band interface; The memory fault location information includes memory row and column address information; The error handler is configured to: receive memory fault location information forwarded by the system management unit, translate the memory row and column address information into memory page address information, generate error record data containing the memory page address information and error severity information, write the error record data into a reserved error block buffer, and trigger the generation of a system interrupt signal; When the operating system running on the central processing unit receives a system interrupt signal, it calls the target hardware error source driver to read and parse the error record data from the error block buffer. The operating system then performs a memory page offline operation on the page to be isolated based on the error record data.

16. The memory fault handling system according to any one of claims 1 to 11, characterized in that, It also includes a coprocessor located in the central processing unit; The management controller is further configured to: when the fault repair operation triggered by the second process is a runtime encapsulated repair operation, generate an encapsulated repair operation request containing the fault physical address, and send the encapsulated repair operation request to the system management unit of the central processing unit through the out-of-band management bus; The system management unit sends the post-encapsulation repair operation request to the coprocessor, and the coprocessor executes the post-encapsulation repair operation request in a trusted execution environment.

17. The memory fault handling system according to any one of claims 1 to 11, characterized in that, The management controller is also configured to: The monitoring parameters of the autonomous debugging data collection hardware module built into the central processing unit are configured through the out-of-band management bus. When the memory controller of the central processing unit detects memory error information, the autonomous debugging data collection hardware module automatically captures the memory error information. The first process obtains the memory error information from the autonomous debugging data collection hardware module through the out-of-band management bus according to the initial data acquisition parameters.

18. A memory fault handling method, characterized in that, Applied to management controllers, including: Establish a first and second process that are mutually isolated; During the process of collecting memory error information using the first process according to the initial data acquisition parameters, the acquisition parameter adjustment trigger signal is monitored, and the initial data acquisition parameters are adjusted according to the signal type of the acquisition parameter adjustment trigger signal. The first process sends memory error information to the second process; the second process predicts memory failures, and when a memory failure is predicted, the second process triggers a failure repair operation.

19. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the memory fault handling method as described in claim 18 when executing the computer program.

20. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the memory fault handling method as described in claim 18.