Check patentability & draft patents in minutes with Patsnap Eureka AI!

Troubleshooting method, device and server

A fault handling method and server technology, applied in the computer field, can solve problems such as coarse fault location granularity, failure of the server to collect fault information and fault location, and inability to provide more fault data.

Active Publication Date: 2022-05-13
HUAWEI TECH CO LTD
View PDF16 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In a server including multiple AI chips, from the perspective of the host management system, the AI ​​chip is a peripheral component interconnect express (PCIe) device. Fault location relies on advanced error reporting (advanced error reporting, AER). Fault location is achieved through in-band logs such as PCIe AER Code. However, when locating based on PCIe AERCode, it can only be located when it is connected to the central processing unit (CPU). The fault location granularity of AI chip is relatively coarse, so it cannot provide more fault data to support fault repair, and when the host management system fails, the server cannot collect fault information and fault location

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Troubleshooting method, device and server
  • Troubleshooting method, device and server
  • Troubleshooting method, device and server

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0061] The technical solution in this application will be described below with reference to the accompanying drawings.

[0062] The terms "first" and "second" in the embodiments of the present application are used for description purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features.

[0063] In the embodiments of the present application, "at least one" means one or more, and "multiple" means two or more. "And / or" describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and / or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A, B can be singular or plural. The character " / " generally indicates that the contextual objects are an "or" relations...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present application provides a fault handling method, device, and server. The method includes: the baseboard management controller BMC receives the first fault information, and when it is determined that the fault will interfere with the normal operation of the server according to the first fault information, the BMC actively obtains the faulty target The second fault information corresponding to the PCIe device, the second fault information includes the fault information inside the target PCIe device and the modules connected to the target PCIe device, and then the BMC locates the server where the target PCIe device is located according to the first fault information and the second fault information failure. When the BMC determines that the fault will interfere with the normal operation of the server based on the initially obtained fault information, the BMC can further collect the fault information of the target PCIe device, so that the collection of fault information is more comprehensive and the fault location is more accurate. You can use the out-of-band BMC to collect fault information and accurately locate faults.

Description

technical field [0001] The present application relates to the field of computer technology, and in particular to a fault handling method, device and server. Background technique [0002] In AI computing in the field of artificial intelligence (AI), the demand for computing power of servers continues to increase, and at the same time, the requirements for reliability, availability and serviceability (RAS) of servers are also getting higher and higher. . [0003] In order to provide sufficient computing power, a server that integrates multiple AI chips (such as image processing unit (graphics processing unit, GPU), neural processing unit (neural processing unit, NPU), and tensor processing unit (tensor processing unit, TPU), etc.) Came into being. Through the interconnection of multiple AI chips to form a multi-P system, it provides stronger computing power for AI computing. In a server including multiple AI chips, from the perspective of the host management system, the AI ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F11/07G06F11/22
CPCG06F11/221G06F11/0775
Inventor 李钟宋刚
Owner HUAWEI TECH CO LTD
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More