Online learning-based super computer node active fault-tolerant method

A technology for supercomputers and node failures, which is applied to the redundancy of calculations and operations for data error detection, generation of response errors, etc., and can solve problems such as high fault tolerance overhead

Active Publication Date: 2016-06-29
NAT UNIV OF DEFENSE TECH
View PDF4 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] The technical problem to be solved in the present invention is to propose an active fault-tolerant method for supercomputer node faults based on online learning for the defect that the system-level checkpoint method has a large fault-tolerant overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Online learning-based super computer node active fault-tolerant method
  • Online learning-based super computer node active fault-tolerant method
  • Online learning-based super computer node active fault-tolerant method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] figure 1 It is a schematic diagram of the execution effect of each computing node using the traditional system-level checkpoint recovery method (Research and Analysis of Du Yunfei's Fault-Tolerant Parallel Algorithm, National University of Defense Technology, Doctoral Thesis, 2008, pages 7-12, 30-32), T c is the time required to perform a system-level checkpoint, T rc is the recovery time for a failure. The unshaded part in the figure is the time for the system to execute computing tasks. Assume that the cross-shaped position is the time point when the fault occurs, and the triangular position is the position where the program continues to execute after the fault is recovered. Taking the petascale supercomputer system "Tianhe-1" as an example, the mean time between failures (MTBF) of the system is several hours, and the time required to perform a system-level checkpoint T c It has reached more than ten minutes, the time T required to perform fault recovery rc Than T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an online learning-based super computer node active fault-tolerant method, and aims at overcoming the defect of large fault tolerance overhead in a system level checkpoint method. A technical scheme comprises the following steps: for a constructed super computer system, a service node collects history state data of a new fault node, performs centralized online learning on a node fault behaviour by utilizing the data, and acquires an updated fault predictor; each calculating node acquires respective state data, predicts whether the calculating node is going to fail or not by utilizing the new fault predictor, if so, process migration is performed on an application process which is operated on the corresponding node; then, the service node and the calculating nodes sleep to wait for a specified active fault-tolerant time interval delta, and continue a new turn of active fault-tolerant process. By adopting the online learning-based super computer node active fault-tolerant method, a super computer node fault can be predicted in advance and low-overhead active fault tolerance is implemented, so that the problem of the large fault tolerance overhead in the system level checkpoint method is solved; the availability of the super computer system is improved.

Description

technical field [0001] The invention mainly relates to a fault-tolerant method of a supercomputer system, in particular how to implement a low-overhead active fault-tolerant method for a supercomputer node by using online machine learning technology. Background technique [0002] A supercomputer system can greatly reduce the execution time required for large-scale computing tasks by combining numerous computing components to perform the same computing task in parallel. The usual supercomputer system is composed of one or more service nodes for login management, and numerous computing nodes for completing computing tasks. The service nodes and computing nodes communicate with each other through the monitoring and management network. The monitoring and management network is used for Maintenance and management of supercomputer systems. A monitoring system is deployed on the service node, which can monitor the operation of each computing node; and a resource management system i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F11/30G06F11/14
CPCG06F11/1425G06F11/3006G06F11/3055
Inventor 蒋艳凰卢宇彤赵强利周恩强董勇胡维孙勤
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products