Check patentability & draft patents in minutes with Patsnap Eureka AI!

Method and device for server self-healing

A server and abnormal information technology, applied in the server field, can solve problems such as system hidden dangers, limited system reliability improvement, business interruption, etc., to reduce the possibility of manual on-site intervention and operation, and restore the normal working state.

Active Publication Date: 2020-09-04
NANJING ZHONGXING XIN SOFTWARE CO LTD
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The reliability of the server's memory directly affects the stability and reliability of the board. Memory problems directly lead to business interruption, and in serious cases, there will be downtime
Although most high-performance and high-reliability servers use memory with ECC (Error Checking and Correcting, error checking and correction) functions, the improvement of system reliability is also limited.
There are mainly the following aspects: First, after a correctable ECC error occurs, although the memory with this ECC function can automatically correct the error, if it occurs frequently, it means that there is a serious hidden danger in the memory, so this automatic error correction The processing method is relatively passive, because serious hidden dangers in the system have not been ruled out; second, after uncorrectable ECC or other unrecoverable errors occur, the system will have serious consequences such as blue screen or downtime. If such serious consequences do not involve out-of-band , only the on-site personnel can shut down the server and replace the memory

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for server self-healing
  • Method and device for server self-healing
  • Method and device for server self-healing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] Such as figure 1 The shown server management system structure includes SMM and several slave nodes, that is, BMCs on each server, and each server board has a BIOS. The SMM is connected to the BMC of each server through IPMB (Intelligent Platform Management BUS, Intelligent Platform Management Bus) / LAN (Local Area Network, local area network) and other methods. BMC and BIOS can communicate through various types of physical channels. This system The structure provides a physical channel for SMM to manage server memory exceptions. In the server system, the server mostly uses the memory that supports the ECC function, which provides hardware prerequisites for the timely detection of memory abnormalities. The main function of B / C is to configure how the BMC handles memory exceptions, such as configuring a policy, such as restarting the board and isolating the fault when the frequency of a recoverable memory fault on a certain memory module is greater than a certain threshol...

Embodiment 2

[0100] Such as Figure 4 As shown, it is a flow chart of the server self-healing method in Embodiment 2 of the present invention. in:

[0101] The BIOS is responsible for detecting memory abnormalities. It can distinguish between recoverable one-bit ECC errors and unrecoverable two-bit ECC errors, and can locate the fault to a specific physical memory stick; if the system starts up again after self-healing, it can realize abnormal memory stick Quarantined and no longer used.

[0102] The BMC is responsible for forwarding the memory exception reported by the BIOS to the SMM, and reporting the faulty memory module information to the basic input and output system BIOS when the server is powered on again.

[0103] The SMM receives the memory fault information forwarded by the out-of-band management module, distinguishes the memory modules for abnormal number statistics, and decides whether to perform self-healing processing on the specified abnormal board according to the seriou...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Provided is a server self-healing method, said method comprising: a baseboard management controller (BMC) receives exception information sent by a basic input / output system (BIOS), said exception information comprising memory exception type and exception memory module identifier; according to said exception information, the BMC or system management module (SMM) generates quarantine memory information, and processes the board accordingly; the BMC sends the quarantine memory information to the BIOS, said quarantine memory information being used for instructing the BIOS to quarantine the corresponding exception memory. By means of the coordination of the BMC, the BIOS, and the SMM, the described solution accomplishes automatic self-healing of a server, which reduces the possibility of on-site manual intervention and operation and restores the server to a state of normal operation as quickly as possible.

Description

technical field [0001] The invention relates to the server field, in particular to a server self-healing method and device. Background technique [0002] At present, operators are facing huge challenges. They must be able to quickly integrate network resources to provide users with the latest services, and at the same time must reduce network procurement costs, operation and maintenance costs, and fault recovery time. A large number of servers owned by operators are installed with a large amount of memory, because memory failures lead to server abnormalities, which reduces the stability of services provided by operators and increases the recovery time and maintenance costs of failures. [0003] In the server, the BMC (Baseboard Management Controller, out-of-band management module) monitors the working status of the server, manages the power-on and power-off of the server, handles it in time and issues an alarm when the server is abnormal. BMC exists as an independent firmwa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F11/22H04L12/24
CPCG06F11/07
Inventor 李军
Owner NANJING ZHONGXING XIN SOFTWARE CO LTD
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More