Active Fault Tolerance Method for Supercomputer Node Fault Based on Online Learning

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology for supercomputers and node failures. It is used in computing and computing redundancy for data error detection and response error generation. It can solve problems such as high fault tolerance overhead.

Active Publication Date: 2018-02-16

NAT UNIV OF DEFENSE TECH

View PDF4 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0015] The technical problem to be solved in the present invention is to propose an active fault-tolerant method for supercomputer node faults based on online learning for the defect that the system-level checkpoint method has a large fault-tolerant overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0072] figure 1 It is a schematic diagram of the execution effect of each computing node using the traditional system-level checkpoint recovery method (Research and Analysis of Du Yunfei's Fault-Tolerant Parallel Algorithm, National University of Defense Technology, Doctoral Thesis, 2008, pages 7-12, 30-32), T c is the time required to perform a system-level checkpoint, T rc is the recovery time for a failure. The unshaded part in the figure is the time for the system to execute computing tasks. Assume that the cross-shaped position is the time point when the fault occurs, and the triangular position is the position where the program continues to execute after the fault is recovered. Taking the petascale supercomputer system "Tianhe-1" as an example, the mean time between failures (MTBF) of the system is several hours, and the time required to perform a system-level checkpoint T c It has reached more than ten minutes, the time T required to perform fault recovery rc Than T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an online learning-based super computer node active fault-tolerant method, and aims at overcoming the defect of large fault tolerance overhead in a system level checkpoint method. A technical scheme comprises the following steps: for a constructed super computer system, a service node collects history state data of a new fault node, performs centralized online learning on a node fault behaviour by utilizing the data, and acquires an updated fault predictor; each calculating node acquires respective state data, predicts whether the calculating node is going to fail or not by utilizing the new fault predictor, if so, process migration is performed on an application process which is operated on the corresponding node; then, the service node and the calculating nodes sleep to wait for a specified active fault-tolerant time interval delta, and continue a new turn of active fault-tolerant process. By adopting the online learning-based super computer node active fault-tolerant method, a super computer node fault can be predicted in advance and low-overhead active fault tolerance is implemented, so that the problem of the large fault tolerance overhead in the system level checkpoint method is solved; the availability of the super computer system is improved.

Description

technical field [0001] The invention mainly relates to a fault-tolerant method of a supercomputer system, in particular how to implement a low-overhead active fault-tolerant method for a supercomputer node by using online machine learning technology. Background technique [0002] A supercomputer system can greatly reduce the execution time required for large-scale computing tasks by combining numerous computing components to perform the same computing task in parallel. The usual supercomputer system is composed of one or more service nodes for login management, and numerous computing nodes for completing computing tasks. The service nodes and computing nodes communicate with each other through the monitoring and management network. The monitoring and management network is used for Maintenance and management of supercomputer systems. A monitoring system is deployed on the service node, which can monitor the operation of each computing node; and a resource management system i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F11/30G06F11/14

CPCG06F11/1425G06F11/3006G06F11/3055

Inventor蒋艳凰卢宇彤赵强利周恩强董勇胡维孙勤

OwnerNAT UNIV OF DEFENSE TECH

Active Fault Tolerance Method for Supercomputer Node Fault Based on Online Learning

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology