Unlock instant, AI-driven research and patent intelligence for your innovation.

GPU fault diagnosis system, diagnosis method, equipment and readable storage medium

A technology of a fault diagnosis system and a diagnosis method, which is applied in the field of GPU fault diagnosis, can solve problems such as low fault diagnosis accuracy, fault location troubles, and incomplete log collection, so as to improve the efficiency and accuracy of fault diagnosis, improve efficiency and accuracy , the effect of reducing technical requirements

Pending Publication Date: 2021-12-10
INSPUR SUZHOU INTELLIGENT TECH CO LTD
View PDF11 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] When judging GPU faults based on server out-of-band logs, because the current server out-of-band logs cannot effectively monitor the GPU running status, the accuracy of fault diagnosis is low
[0007] When judging GPU faults based on the in-band logs provided by customers, due to differences in customer technical levels, the in-band logs provided by customers are incompletely collected, and GPU faults cannot be accurately located.
[0008] When judging GPU faults based on customer repair descriptions, different customers describe GPU faults in different ways, and the description accuracy is poor, which brings great trouble to fault location.
[0009] In addition, most customers do not allow logging into the OS for troubleshooting and do not provide in-band logs
The technical level of on-site engineers is uneven, and the GPU fault diagnosis methods and tools are complicated to use

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • GPU fault diagnosis system, diagnosis method, equipment and readable storage medium
  • GPU fault diagnosis system, diagnosis method, equipment and readable storage medium
  • GPU fault diagnosis system, diagnosis method, equipment and readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0045] In this embodiment, a kind of GPU fault diagnosis system (abbreviated as AI EasyCfg in this embodiment, abbreviated as Artificial Intelligence Easy Configure) for x86 server is provided, which is suitable for NVIDIA GPU state detection and functional testing, and can improve Work efficiency of on-site engineers and accuracy of GPU fault judgment. It has functions such as humanized interaction, one-click log collection, fault log diagnosis, GPU real-time status detection and stress test, and fault handling suggestions.

[0046] In this example, if figure 1 As shown, a GPU fault diagnosis system is built under linux OS. The computer system adopts CUDA (Compute Unified Device Architecture), which is a computing platform launched by graphics card manufacturer NVIDIA. It is a general-purpose parallel computing architecture, which enables GPU to solve...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a GPU fault diagnosis system, method and device and a readable storage medium. The system comprises a one-key log collection module used for achieving the one-key collection of system in-band logs, GPU fault logs and GPU operation state index files, a fault log inspection module used for inspecting GPU logs, outputting fault information and giving processing suggestions, a GPU real-time state detection module used for detecting the real-time running state of the GPU in a one-key mode, automatically discovering faults and giving out processing suggestions, a GPU pressure test module used for diagnosing GPU difficult faults, a GPU drive one-key replacement module used for one-key replacement of a GPU drive version, a log module used for outputting and storing logs, and a GPU driving module used for guaranteeing the operation of the GPU. The functions of one-key log collection, fault log inspection, GPU real-time state detection, GPU pressure testing, GPU drive one-key replacement, processing suggestion providing and the like can be achieved, and engineers can conveniently position faults on site and feed back the collected logs to the background for processing.

Description

technical field [0001] The invention relates to the technical field of GPU fault diagnosis, in particular to a GPU fault diagnosis system, a diagnosis method, a device and a readable storage medium. Background technique [0002] At present, the fields related to artificial intelligence are developing rapidly, and the number of AI server market has increased sharply. The rapid fault diagnosis of GPU (Graphics Processing Unit, English: Graphics Processing Unit, abbreviation: GPU) has also become an important part of server after-sales service. Diagnose the following problems: [0003] On-site operation and maintenance personnel / third-party engineers have different technical levels, and it takes a long time of training and practice to be competent for GPU fault diagnosis. There are problems such as long time for GPU fault diagnosis and low fault judgment accuracy. [0004] Customers are not allowed to perform GPU troubleshooting after logging in to the OS, and it is extremely ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G01R31/30
CPCG01R31/30Y02D10/00
Inventor 张健陈彬刘海洲
Owner INSPUR SUZHOU INTELLIGENT TECH CO LTD