A method, system, and electronic device for detecting server memory faults.

By monitoring interrupt signals of the server system through the baseboard management controller, parsing memory unit attribute information and generating early warnings, the problem of not being able to detect memory faults in advance in existing technologies is solved, thus ensuring server stability.

CN116643943BActive Publication Date: 2026-06-30INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2023-05-26
Publication Date
2026-06-30

Smart Images

  • Figure CN116643943B_ABST
    Figure CN116643943B_ABST
Patent Text Reader

Abstract

This specification provides a server memory fault detection method, system, and electronic device, capable of early detection and warning of memory faults, ensuring the stability of server system operation. The method includes: monitoring a target server; when the target server triggers a system interrupt signal, collecting basic error information of the memory units corresponding to the system interrupt signal; parsing the basic error information to determine the corresponding memory unit attribute information stored in an internal database; querying and detecting the data content in the internal database to determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period; determining whether a memory fault exists in the corresponding memory unit based on the information frequency; and when a memory fault is determined to exist in the memory unit, generating corresponding alarm information based on the memory unit attribute information and reporting it.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer memory technology, specifically to a server memory fault detection method, system, and electronic device. Background Technology

[0002] Modern computer systems are becoming increasingly complex, and various application scenarios require a large number of backend servers for support, placing higher demands on server stability. Memory failures are a key factor affecting server stability. Accurately diagnosing memory failures, recording memory alarm information, and reducing the probability of memory-triggered crashes have become core technical challenges in the industry and a focus of customer attention. Related technologies primarily focus on fault diagnosis and repair after system crashes, but they cannot provide early detection and warnings for potential memory failures, thus failing to guarantee server system stability. Summary of the Invention

[0003] In view of this, the embodiments of this specification provide a server memory fault detection method, system and electronic device, which can detect and warn of memory faults in advance, and can ensure the stability of server system operation.

[0004] In a first aspect, embodiments of this specification provide a server memory fault detection method, applied to a baseboard management controller. The method includes:

[0005] The system monitors the target server and collects basic error information of the memory unit corresponding to the system interrupt signal when the target server triggers a system interrupt signal.

[0006] The basic error information is parsed to determine the corresponding memory unit attribute information, and the memory unit attribute information is stored in an internal database.

[0007] The data content in the internal data is queried and detected to determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period;

[0008] Based on the frequency of the information, it is determined whether the corresponding memory unit has a memory fault. When it is determined that the memory unit has a memory fault, corresponding alarm information is generated and reported according to the memory unit attribute information.

[0009] This specification also provides a server memory fault detection system applied to a baseboard management controller. The system includes:

[0010] The system signal monitoring module is used to monitor the target server. When the target server triggers a system interrupt signal, it collects basic error information of the memory unit corresponding to the system interrupt signal.

[0011] The memory attribute determination module is used to parse the basic error information to determine the corresponding memory unit attribute information and store the memory unit attribute information in an internal database.

[0012] The database query module is used to query and detect the data content in the internal database, and determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period; and

[0013] The fault determination module is used to determine whether the corresponding memory unit has a memory fault based on the frequency of the information. When it is determined that the memory unit has a memory fault, the module generates corresponding alarm information based on the memory unit attribute information and reports it.

[0014] This specification also provides an electronic device for detecting server memory faults, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the server memory fault detection method as described in the first aspect.

[0015] As can be seen from the above, the server memory fault detection method, system, and electronic device provided in the embodiments of this specification have the following beneficial technical effects:

[0016] The described server memory fault detection method utilizes BMC to monitor system interrupt signals of the server system, analyzes the system interrupt signals to determine the memory cell information corresponding to the memory unit, and diagnoses whether a memory fault exists based on the frequency of the memory cell attribute information. This method can efficiently and quickly accurately determine memory cell faults and promptly generate alarm information for notification when a memory cell fault is determined, so that staff can carry out recovery processing. It can detect memory faults in advance and provide early warning, effectively avoiding server downtime and ensuring the stability of server system operation. Attached Figure Description

[0017] The features and advantages of the invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way. In the drawings:

[0018] Figure 1 This diagram illustrates a server memory fault detection method provided by one or more optional embodiments of this specification.

[0019] Figure 2 This diagram illustrates a method for determining memory cell attribute information in one or more optional embodiments of a server memory fault detection method provided in this specification.

[0020] Figure 3This diagram illustrates the data item structure of the internal database in a server memory fault detection method provided by one or more optional embodiments of this specification.

[0021] Figure 4 This illustration shows a method for updating the internal database when sending a command message to the internal database in one or more optional embodiments of a server memory fault detection method provided in this specification.

[0022] Figure 5 This diagram illustrates a method for performing memory isolation in a server memory fault detection method provided by one or more optional embodiments of this specification;

[0023] Figure 6 This specification shows a schematic diagram of the structure of a server memory fault detection system provided by one or more optional embodiments;

[0024] Figure 7 This specification shows a schematic diagram of the structure of a server memory fault detection electronic device provided by one or more optional embodiments. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0026] Modern computer systems are becoming increasingly complex, and various application scenarios require a large number of backend servers for support, placing higher demands on server stability. Memory failures are a key factor affecting server stability. Accurately diagnosing memory failures, recording memory alarm information, and reducing the probability of memory-triggered crashes have become core technical challenges in the industry and a focus of customer attention. Related technologies primarily focus on fault diagnosis and repair after system crashes, but they cannot provide early detection and warnings for potential memory failures, thus failing to guarantee server system stability.

[0027] To address the aforementioned issues, the purpose of this specification is to propose a server memory fault detection method, system, and electronic device. By utilizing a Baseboard Management Controller (BMC) to monitor system interrupt signals of the server system, determine the relevant attribute information of the memory unit corresponding to the system terminal signal, and perform fault diagnosis and detection based on the relevant attribute information, memory faults can be detected in advance to provide early warning, avoid server downtime, and ensure the stability of server system operation.

[0028] To achieve the above objectives, one aspect of this specification provides a method for detecting server memory faults.

[0029] like Figure 1 As shown, this specification provides a server memory fault detection method according to one or more optional embodiments, applied to a baseboard controller (BMC). The method includes:

[0030] S1: Monitor the target server. When the target server triggers a system interrupt signal, collect basic error information of the memory unit corresponding to the system interrupt signal.

[0031] During server operation, an interrupt signal is triggered when an abnormal situation occurs in the server memory. Various system signals of the target server can be monitored during its operation to intercept system interrupt signals triggered during system operation, thereby identifying memory units where abnormal situations may exist based on these interrupt signals.

[0032] In some alternative implementations, the following methods can be used to monitor the target server:

[0033] The baseboard controller (BMC) is communicatively connected to the basic input / output system (BIOS) in the target server. The BIOS system receives various system signals related to the target server and performs real-time monitoring and identification of these system signals.

[0034] The system signals are identified to determine if a system interrupt signal is present. When an abnormality occurs in the server memory, a system interrupt signal, namely the System Management Interrupt (SMI) signal, will be triggered during the operation of the target server system.

[0035] When the system interrupt signal SMI exists in the target server, the SMI signal is extracted for further analysis.

[0036] By parsing the system interrupt signal, basic error information of the corresponding memory unit can be obtained. In some optional embodiments, the following methods can be used to collect the basic error information of the memory unit corresponding to the system interrupt signal:

[0037] The interrupt handler function that triggers the system interrupt signal SMI in the target server system is identified. Reverse engineering the SMI based on this interrupt handler function allows us to determine the source of the system interrupt, namely the memory cell where an anomaly may exist. Simultaneously, the coordinates and system physical address of the memory cell can also be determined.

[0038] The basic error information includes the system interrupt signal and the coordinate information and system physical address of the corresponding memory unit.

[0039] S2: Analyze the basic error information to determine the corresponding memory unit attribute information, and store the memory unit attribute information in the internal database.

[0040] For memory units that may have abnormal conditions, further analysis can be performed based on their corresponding basic error information to determine more comprehensive related attribute information.

[0041] like Figure 2 As shown, in one or more optional embodiments of this specification, a server memory fault detection method is provided, in which the basic error information is parsed to determine the corresponding memory unit attribute information, including:

[0042] S201: Based on the coordinate information and the corresponding local hardware configuration information of the target server, query and determine the memory module to which the memory unit belongs in the target server and the positioning row and column information in the memory module.

[0043] The local hardware configuration information of the target server is obtained. This local hardware configuration information records the specific hardware attribute information of the memory device modules in the server. The hardware attribute information may include the number of memory modules configured in the memory device module, the corresponding firmware parameters, the manufacturer, the corresponding installation location, the storage unit row and column information of the memory sticks in each memory module, silkscreen information, etc.

[0044] The local hardware configuration information can be stored on the server in the form of an asset information list, and the local hardware configuration information can be obtained through query commands.

[0045] Based on the local hardware configuration information, the memory module to which the memory unit belongs in the target server can be determined according to the coordinate information, as well as the specific row and column information within the memory module. This coordinate information helps staff accurately locate specific memory units.

[0046] S202: Translate the physical address of the system to generate the corresponding logical memory address of the memory unit in the target server.

[0047] During server system operation, data transmission, read / write operations utilize the system logical addresses of various memory units. These system physical addresses can be translated into corresponding memory logical addresses based on the server system architecture information. These memory logical addresses facilitate the system's execution of instructions to the appropriate memory units.

[0048] The memory unit attribute information includes the memory module corresponding to the memory unit, the positioning row and column information in the memory module, and the corresponding memory logical address.

[0049] After determining the memory attribute information corresponding to the memory unit, the memory attribute information can be stored in a dedicated internal database. The internal database is used to maintain memory unit information for the entire server system and can include multiple data items, each recording relevant information for one memory unit.

[0050] In some optional implementations, the memory unit attribute information can be stored in an internal database by: formatting the memory unit attribute information to form a uniformly formatted command message, and sending the command message to the internal database to generate corresponding data items. The command message can be an IPMI command message.

[0051] like Figure 3 As shown, in a server memory fault detection method provided by one or more optional embodiments of this specification, the data items include memory cell coordinates, error flags, error counts, physical memory addresses, and logical memory addresses.

[0052] The memory cell coordinates are used to record the row and column information of the corresponding memory cell, and the row and column information is used to indicate the specific physical location of the memory cell.

[0053] The error flag is used to indicate whether a memory error has occurred in the corresponding memory unit, causing a system interrupt signal to be triggered. Figure 3The “CE (Correctable Error) flag” indicates the error flag item.

[0054] The error count item represents the number of times that the corresponding memory unit triggers a memory error, resulting in the triggering of a system terminal signal. Figure 3 The "CE (Correctable Error) Quantity" in the text refers to the number of errors mentioned above.

[0055] The physical memory address is used to represent the actual physical address information of the corresponding memory unit in the target server.

[0056] The memory logical address is used to represent the specific logical address information of the corresponding memory unit in the target server system.

[0057] The data items in the internal database can record detailed information about the corresponding memory units.

[0058] like Figure 4 As shown, in one or more optional embodiments of this specification, a server memory fault detection method further includes, when sending the command message to the internal database:

[0059] S401: Determine whether an associated data item corresponding to the command message already exists in the internal database. The associated data item refers to the data item that corresponds to the same memory unit as the command message.

[0060] S402: In response to the existence of the associated data item in the memory database, update the associated data item according to the command message.

[0061] Updating the associated data item mainly involves updating the specific value of the Error Count (CE) item within that associated data item. Taking memory unit A as an example, during server operation, when memory unit A triggers a system interrupt signal for the first time, the fault detection method writes a data item corresponding to memory unit A into the internal database, where the CE count is recorded as 1. When memory unit A triggers a system interrupt signal again, the associated data item already exists in the internal database, and the CE count in the associated data item is updated to 2.

[0062] S403: In response to the absence of the associated data item in the memory database, a new data item is generated based on the data content in the command message.

[0063] If the associated data item does not exist in the memory database, it means that the corresponding memory is likely to trigger a system interrupt signal for the first time within a preset time period. In this case, a new data item is generated in the internal database, which records various information of the corresponding memory unit.

[0064] S3: Query and detect the data content in the internal data to determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period.

[0065] The frequency of the memory unit attribute information generated corresponding to the memory unit within a preset time period can be determined based on the number of CEs recorded in the data items of the internal database. This information frequency characterizes the number of times the memory unit triggers a system interrupt signal within the preset time period, that is, the number of times the memory unit may experience abnormal conditions within the corresponding time period. It is understood that the number of times a memory unit may experience abnormal conditions can, to some extent, characterize the risk of a memory failure leading to server system downtime; a higher number indicates a higher risk of failure.

[0066] Based on the above analysis, in some optional embodiments, the following method can be used to determine whether the corresponding memory unit has a memory fault: compare the information frequency with a preset error frequency threshold; in response to the information frequency exceeding the preset error frequency threshold, determine that the corresponding memory unit has a memory fault.

[0067] The frequency of information, i.e., the number of times an anomaly occurs in a corresponding memory unit, can be used to measure and characterize the risk of memory failure. Comparing it to a preset error threshold, exceeding the threshold confirms a memory failure in the memory unit, which could potentially cause server system crashes. The preset error threshold can be flexibly set and adjusted according to actual conditions; generally, it can be set to 2 or 3. That is, if a memory unit experiences two or more anomalies within a preset time period, it can be determined that the memory unit is faulty and poses a significant risk. The preset time period can also be set and adjusted according to actual conditions; for example, it can be set to 24 hours or 12 hours.

[0068] S4: Based on the frequency of the information, determine whether the corresponding memory unit has a memory fault. When it is determined that the memory unit has a memory fault, generate corresponding alarm information according to the memory unit attribute information and report it.

[0069] When a memory fault is detected in the memory unit, an alarm should be issued in a timely manner. An alarm message should be generated based on the corresponding memory unit attribute information and reported to the relevant personnel so that they can take appropriate action and eliminate the memory fault.

[0070] In some alternative implementations, the following methods can be used to generate warning information:

[0071] Based on the memory cell attribute information, the memory module corresponding to the faulty memory cell and the location row and column information of the memory cell within that memory module are determined. When a fault is determined in a memory cell, relevant memory coordinate information can also be extracted from the data item corresponding to the memory cell in the internal database. This includes the memory module to which the memory cell belongs and its location row and column information within the corresponding memory module. The alarm information includes the memory module, the corresponding location row and column information, a memory fault alarm prompt, and the corresponding prompt time information.

[0072] The memory module and corresponding row and column positioning information in the warning information help staff quickly locate the actual physical location of the faulty memory unit, perform fault recovery processing, or replace the faulty memory unit or memory module, thereby ensuring the safe operation of the server system, avoiding server system downtime due to memory failure, and improving the stability of server system operation.

[0073] The described server memory fault detection method utilizes BMC to monitor system interrupt signals of the server system, analyzes the system interrupt signals to determine the memory cell information corresponding to the memory unit, and diagnoses whether a memory fault exists based on the frequency of the memory cell attribute information. This method can efficiently and quickly accurately determine memory cell faults and promptly generate alarm information for notification when a memory cell fault is determined, so that staff can carry out recovery processing. It can detect memory faults in advance and provide early warning, effectively avoiding server downtime and ensuring the stability of server system operation.

[0074] like Figure 5 As shown, one or more optional embodiments of this specification provide a server memory fault detection method, which, after determining that a memory unit has a memory fault, further includes:

[0075] S501: Determine the physical and logical memory addresses of the memory cells where memory faults exist.

[0076] The physical memory address and the logical memory address can be determined based on the corresponding memory unit attribute information, or they can be extracted from the data item corresponding to the memory unit in the internal database.

[0077] S502: Based on the physical memory address and the logical memory address, perform a memory isolation operation on the memory unit so that the target server ignores the memory unit.

[0078] The BIOS system's memory diagnostic interface can be invoked to perform memory isolation operations on faulty memory cells based on their physical and logical addresses. After memory isolation, the target server system will not access the faulty memory cell during operation, thus preventing system crashes caused by the memory cell's failure. Memory isolation can be performed simultaneously with alarm message generation. Memory isolation is achieved before staff receive the alarm and take appropriate action, effectively preventing memory failures from impacting the overall server system.

[0079] The server memory fault detection method provided in one or more optional embodiments of this specification further includes:

[0080] If the frequency of the information corresponding to the memory cell does not exceed the preset error frequency threshold within a preset time period, it is determined that the corresponding memory cell is not faulty.

[0081] The data content of the memory cells that do not have faults will be cleared in the internal data.

[0082] The maintenance of data content in the internal database is time-sensitive, and its corresponding maintenance duration is the preset time period. Within the preset time period, if the frequency of the information corresponding to the memory unit does not exceed the preset error threshold, it indicates that although the memory unit has an abnormal situation, it may be due to other reasons triggering a system interrupt signal, which is a false alarm, or the memory unit has already returned to normal. It can be classified as a normal memory unit and considered to be without fault. Therefore, the corresponding data content of the memory unit in the internal memory is cleared, and the internal memory is updated.

[0083] It should be noted that the methods of one or more embodiments of this specification can be executed by a single device, such as a computer or server. The methods of this embodiment can also be applied in a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the methods of one or more embodiments of this specification, and the multiple devices will interact with each other to complete the method described.

[0084] It should be noted that the above description describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims may be performed in a different order than that shown in the embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0085] Based on the same inventive concept, and corresponding to any of the above embodiments, this specification also provides a server memory fault detection system, which is applied to a baseboard management controller.

[0086] refer to Figure 6 The server memory fault detection system includes:

[0087] The system signal monitoring module is used to monitor the target server. When the target server triggers a system interrupt signal, it collects basic error information of the memory unit corresponding to the system interrupt signal.

[0088] The memory attribute determination module is used to parse the basic error information to determine the corresponding memory unit attribute information and store the memory unit attribute information in an internal database.

[0089] The database query module is used to query and detect the data content in the internal database, and determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period; and

[0090] The fault determination module is used to determine whether the corresponding memory unit has a memory fault based on the frequency of the information. When it is determined that the memory unit has a memory fault, the module generates corresponding alarm information based on the memory unit attribute information and reports it.

[0091] In a server memory fault detection system provided in one or more optional embodiments of this specification, the system signal detection module is further configured to receive system signals of the target server using a basic input / output system; identify the system signals to determine whether a system interrupt signal exists therein; and if the system interrupt signal exists, extract the system interrupt signal from it.

[0092] In a server memory fault detection system provided in one or more optional embodiments of this specification, the system signal detection module is further configured to parse the system interrupt signal to determine the coordinate information and system physical address of the memory unit corresponding to the system interrupt signal. The basic error information includes the system interrupt signal and the coordinate information and system physical address of the corresponding memory unit.

[0093] In a server memory fault detection system provided in one or more optional embodiments of this specification, the basic error information includes the coordinate information and system physical address of the corresponding memory unit. The memory attribute determination module is further configured to query and determine the memory module to which the memory unit belongs in the target server and its location row and column information within the memory module, based on the coordinate information and the corresponding local hardware configuration information of the target server; translate the system physical address to generate the corresponding memory logical address of the memory unit in the target server; the memory unit attribute information includes the memory module corresponding to the memory unit, the location row and column information within the memory module, and the corresponding memory logical address.

[0094] In a server memory fault detection system provided in one or more optional embodiments of this specification, the internal database includes multiple data items corresponding to multiple memory units. The memory attribute determination module is further configured to format the memory unit attribute information to form a command message, and send the command message to the internal database to generate corresponding data items.

[0095] In a server memory fault detection system provided in one or more optional embodiments of this specification, the memory attribute determination module, when sending the command message to the internal database, is further configured to determine whether an associated data item corresponding to the command message already exists in the internal database. The associated data item refers to a data item corresponding to the same memory unit as the command message. If the associated data item exists in the memory database, the associated data item is updated according to the command message. If the associated data item does not exist in the memory database, a new data item is generated according to the data content in the command message.

[0096] In a server memory fault detection system provided in one or more optional embodiments of this specification, the internal database includes multiple data items corresponding to multiple memory units. Each data item includes memory unit coordinates, an error flag, an error count, a physical memory address, and a logical memory address. The memory unit coordinates record the row and column information of the corresponding memory unit, indicating its specific physical location. The error flag indicates whether the corresponding memory unit has triggered a memory error, causing a system interrupt signal. The error count indicates the number of times the corresponding memory unit triggered a memory error, causing a system termination signal. The physical memory address represents the actual physical address of the corresponding memory unit in the target server. The logical memory address represents the specific logical address of the corresponding memory unit in the target server system.

[0097] In a server memory fault detection system provided in one or more optional embodiments of this specification, the database query module is further configured to determine the information frequency corresponding to the corresponding memory unit based on the specific value in the error quantity item of the data item.

[0098] In a server memory fault detection system provided in one or more optional embodiments of this specification, the fault determination module is further configured to compare the information frequency with a preset error frequency threshold; when the information frequency exceeds the preset error frequency threshold, it is determined that the corresponding memory unit has a memory fault.

[0099] This specification provides a server memory fault detection system according to one or more optional embodiments, which further includes a database maintenance module. The database maintenance module is used to determine that the corresponding memory unit is not faulty when the frequency of the information corresponding to the memory unit does not exceed the preset error frequency threshold within a preset time period, and to clear the corresponding data content of the fault-free memory unit from the internal data.

[0100] This specification provides a server memory fault detection system according to one or more optional embodiments, which further includes a memory isolation processing module. The memory isolation processing module is used to determine the physical memory address and logical memory address of the memory cell with a memory fault; and to perform a memory isolation operation on the memory cell based on the physical memory address and the logical memory address, so that the target server ignores the memory cell.

[0101] In a server memory fault detection system provided in one or more optional embodiments of this specification, the fault determination module is further configured to determine the memory module corresponding to the faulty memory unit and the location row and column information of the memory unit in the memory module based on the memory unit attribute information; the alarm information includes the memory module, the corresponding location row and column information, and memory fault alarm prompts and corresponding prompt time information.

[0102] For ease of description, the above system is described by dividing it into various modules based on their functions. Of course, when implementing one or more embodiments of this specification, the functions of each module can be implemented in one or more software and / or hardware.

[0103] The system described above is used to implement the corresponding methods in the foregoing embodiments and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0104] Figure 7 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.

[0105] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.

[0106] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.

[0107] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.

[0108] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0109] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.

[0110] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.

[0111] The electronic devices described above are used to implement the corresponding methods in the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0112] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the server memory fault detection method as described in any of the above embodiments.

[0113] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0114] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to execute the server memory fault detection method as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0115] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk drive (HDD), or solid-state drive (SSD), etc.; the storage medium can also include combinations of the above types of memory.

[0116] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.

[0117] For ease of description, the above devices are described separately by function as various units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware.

[0118] Those skilled in the art will understand that embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0119] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0120] This application can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0121] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0122] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of one or more embodiments of this specification as described above, which are not provided in detail for the sake of brevity.

[0123] Although this disclosure has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.

[0124] One or more embodiments of this specification are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this specification should be included within the scope of protection of this disclosure.

Claims

1. A method for detecting memory failure in a server, the method comprising: The method is applied to a baseboard management controller; the method includes: The system monitors the target server and collects basic error information of the memory unit corresponding to the system interrupt signal when the target server triggers a system interrupt signal. The basic error information is parsed to determine the corresponding memory unit attribute information, and the memory unit attribute information is stored in an internal database. The data content in the internal database is queried and detected to determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period; Based on the frequency of the information, it is determined whether the corresponding memory unit has a memory fault. When it is determined that the memory unit has a memory fault, corresponding alarm information is generated and reported according to the memory unit attribute information. The basic error information includes the coordinates of the corresponding memory unit and the system physical address; The basic error information is parsed to determine the corresponding memory unit attribute information, including: Based on the coordinate information and the corresponding local hardware configuration information of the target server, the memory module to which the memory unit belongs in the target server and its location row and column information in the memory module are determined. The physical address of the system is translated to generate the corresponding logical memory address of the memory unit in the target server. The memory unit attribute information includes the memory module corresponding to the memory unit, the positioning row and column information in the memory module, and the corresponding memory logical address; After determining that the memory unit has a memory fault, the process also includes: Determine the physical and logical addresses of the memory cells where memory faults exist; Based on the physical memory address and the logical memory address, a memory isolation operation is performed on the memory unit so that the target server ignores the memory unit. Determining whether a memory unit has a memory fault based on the frequency of the information includes: The frequency of the information is compared with a preset error frequency threshold; If the frequency of the information exceeds the preset error frequency threshold, it is determined that the corresponding memory unit has a memory fault.

2. The method according to claim 1, characterized in that, Monitoring of the target server includes: Receive system signals from the target server using a basic input / output system; The system signals are identified to determine whether a system interruption signal exists. In response to the presence of the system interrupt signal, the system interrupt signal is extracted from it.

3. The method according to claim 1, characterized in that, Collect basic error information of the memory unit corresponding to the system interrupt signal, including: The system interrupt signal is parsed to determine the coordinate information of the memory unit corresponding to the system interrupt signal and the system physical address. The basic error information includes the system interrupt signal and the coordinate information and system physical address of the corresponding memory unit.

4. The method according to claim 1, characterized in that, The internal database includes multiple data items corresponding to the multiple memory units; The memory unit attribute information is stored in an internal database, including: The memory unit attribute information is formatted to form a command message, and the command message is sent to the internal database to generate corresponding data items.

5. The method according to claim 4, characterized in that, When sending the command message to the internal database, the method further includes: Determine whether an associated data item corresponding to the command message already exists in the internal database. The associated data item refers to a data item that corresponds to the same memory unit as the command message. In response to the existence of the associated data item in the internal database, the associated data item is updated according to the command message; In response to the absence of the associated data item in the internal database, a new data item is generated based on the data content in the command message.

6. The method according to claim 1, characterized in that, The internal database includes multiple data items corresponding to the multiple memory units. The data items include memory unit coordinates, error flags, error counts, physical memory address, and logical memory address. The memory cell coordinates are used to record the row and column information of the corresponding memory cell, and the row and column information is used to indicate the specific physical location of the memory cell. The error flag is used to indicate whether a memory error has occurred in the corresponding memory unit, causing a system interrupt signal to be triggered; The error count item indicates the number of times the corresponding memory unit triggers a memory error, resulting in the triggering of a system terminal signal. The memory physical address is used to represent the actual physical address information of the corresponding memory unit in the target server; The memory logical address is used to represent the specific logical address information of the corresponding memory unit in the target server system.

7. The method according to claim 6, characterized in that, Determining the frequency of information related to the attribute information of multiple memory units within a preset time period includes: Based on the specific value in the error quantity item of the data item, the information frequency corresponding to the corresponding memory unit is determined.

8. The method according to claim 1, characterized in that, The method further includes; In response to the fact that the frequency of the information corresponding to the memory unit does not exceed the preset error frequency threshold within a preset time period, it is determined that the corresponding memory unit is not faulty; The data content of the memory cells that do not have faults will be cleared in the internal data.

9. The method according to claim 1, characterized in that, Based on the memory unit attribute information, corresponding warning information is generated, including: Based on the memory cell attribute information, determine the memory module corresponding to the faulty memory cell and the location row and column information of the memory cell in the memory module; The warning information includes the memory module, the corresponding location row and column information, and the memory fault alarm prompt and corresponding prompt time information.

10. A server memory fault detection system, characterized in that, The system is applied to a baseboard management controller; the system includes: The system signal monitoring module is used to monitor the target server. When the target server triggers a system interrupt signal, it collects basic error information of the memory unit corresponding to the system interrupt signal. The memory attribute determination module is used to parse the basic error information to determine the corresponding memory unit attribute information and store the memory unit attribute information in an internal database. The database query module is used to query and detect the data content in the internal database, and determine the information frequency of the memory unit attribute information corresponding to multiple memory units within a preset time period; and The fault determination module is used to determine whether the corresponding memory unit has a memory fault based on the frequency of the information. When it is determined that the memory unit has a memory fault, corresponding alarm information is generated and reported based on the memory unit attribute information. The basic error information includes the coordinate information and system physical address of the corresponding memory unit; the memory attribute determination module is further configured to query and determine the memory module to which the memory unit belongs in the target server and its location row and column information in the memory module based on the coordinate information and the corresponding local hardware configuration information of the target server; translate the system physical address to generate the corresponding memory logical address of the memory unit in the target server; the memory unit attribute information includes the memory module corresponding to the memory unit, the location row and column information in the memory module, and the corresponding memory logical address; The system further includes a memory isolation processing module; the memory isolation processing module is used to determine the physical memory address and logical memory address of the memory unit with memory fault; and to perform a memory isolation operation on the memory unit based on the physical memory address and the logical memory address, so that the target server ignores the memory unit; The fault determination module is further configured to compare the information frequency with a preset error frequency threshold; when the information frequency exceeds the preset error frequency threshold, it determines that the corresponding memory unit has a memory fault.

11. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 9.