Fault location method and server

A fault location and server technology, applied in the computer field, can solve problems such as inability to quickly locate faults, system hangs, and difficulty in obtaining MCA.

Active Publication Date: 2016-05-18
XFUSION DIGITAL TECH CO LTD
View PDF4 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] CATERR (Catastrophic Error, catastrophic error) or IERR (Internal Error, catastrophic error) will occur during the operation of the x86 server, causing the system to hang, and then the system business is suddenly interrupted
After the CATERR crash occurs, it is difficult to obtain a complete MCA (Machine Check Architecture, hardware fault inspection architecture) fault record. In addition, even if the MCA fault record is collected, it is impossible to quickly and accurately locate the fault based on a large number of MCA register data.
[0003] At present, the method of fault location for CATERR or IERR mainly relies on manual judgment, or runs a diagnostic program to locate the fault, or replaces the device to determine the faulty device, which cannot quickly locate the fault.
In short, the existing technology has low efficiency in fault location for crashes caused by CATERR or IERR, which seriously affects user experience

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Fault location method and server
  • Fault location method and server
  • Fault location method and server

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0062] An embodiment of the present invention provides a fault location method, such as figure 2 As shown, the method includes the following steps:

[0063] 201. Acquire error data, and determine a timeout error corresponding to the hang-up error in the error data.

[0064] Wherein, the hanging failure is 3-Strike. The error data is used to record errors generated when the server fails, and the timeout error is an error generated when the server fails. The execution subject of the fault location method in this embodiment may be the BMC in the processor of the server where the 3-Strike fault occurs, namely figure 1 The BMC module 103 is shown.

[0065] Of course, the server will first detect the failure notification message. The fault notification message indicates that a hang-up fault has occurred, which may be a CATERR_N signal. For example, when the hang-up failure occurs, the MCA module of the server will send a CATERR_N signal to the BMC (that is, the pin communicati...

Embodiment 2

[0154] The embodiment of the present invention provides a server 30, such as image 3 As shown, the server 30 includes: an acquiring unit 301 , a reading unit 302 , a matching unit 303 and a fault location unit 304 .

[0155] The obtaining unit 301 is configured to obtain error data, and determine a timeout type error corresponding to the hanging type fault in the error data. The error data is used to record errors generated when the server fails, and the timeout error is an error generated when the server fails.

[0156] The reading unit 302 is configured to read the error address in the address register of the timeout error determined by the acquisition unit.

[0157] The matching unit 303 matches the error address with the pre-stored bus and interface standard PCIe device address space table; the address space of each PCIe device and each PCIe device is recorded in the PCIe device address space table corresponding relationship.

[0158] The fault locating unit 304 is con...

Embodiment 3

[0173] The embodiment of the present invention provides a server 40, such as Figure 4 As shown, the server 40 includes: a processor 401 , a system bus 402 and a memory 403 .

[0174] Wherein, the processor 401 may be a central processing unit (English: central processing unit, abbreviation: CPU).

[0175] The memory 403 is used to store program codes, and transmit the program codes to the processor 401, and the processor 401 executes the following instructions according to the program codes. The memory 403 may include a volatile memory (English: volatile memory), such as a random access memory (English: random-access memory, abbreviated: RAM); the memory 403 may also include a non-volatile memory (English: non-volatile memory), such as Read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flashmemory), hard disk (English: harddiskdrive, abbreviation: HDD) or solid-state drive (English: solid-state drive, abbreviation: SSD). The memory 403 m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a fault location method and a server, relates to the technical field of computers, and used for quickly and accurately locating faults when CATERR (Catastrophic Error) or IERR (Internal Error) type faults occur. The fault location method comprises the following steps: detecting a fault notification message; indicating that deadly hanging type faults have occurred by the fault notification message; acquiring wrong data, and determining timeout errors corresponding to the deadly hanging type faults in the wrong data; reading wrong addresses in address registers of the timeout errors; matching the wrong addresses with an address space pre-stored in a PCIe (Peripheral Component Interface Express) equipment address space table; and if the address space matched with the wrong address exists in the PCIe equipment address space table, then determining PCIe equipment corresponding to the address space is a fault source which causes the deadly hanging type faults.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a fault location method and a server. Background technique [0002] CATERR (Catastrophic Error, catastrophic error) or IERR (Internal Error, catastrophic error) will occur during the operation of the x86 server, causing the system to hang, and then the system business is suddenly interrupted. After the CATERR crash occurs, it is difficult to obtain a complete MCA (Machine Check Architecture, hardware fault inspection architecture) fault record. In addition, even if the MCA fault record is collected, it is impossible to quickly and accurately locate the fault based on a large number of MCA register data. [0003] At present, the method of fault location for CATERR or IERR mainly relies on manual experience judgment, or running a diagnostic program to locate the fault, or replacing the device to determine the faulty device, which cannot quickly locate the fault. In short, in the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F11/22
CPCG06F11/2273
Inventor 宋刚
Owner XFUSION DIGITAL TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products