Active Fault Tolerance Method for Supercomputer Node Fault Based on Online Learning
A technology for supercomputers and node failures. It is used in computing and computing redundancy for data error detection and response error generation. It can solve problems such as high fault tolerance overhead.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0072] figure 1 It is a schematic diagram of the execution effect of each computing node using the traditional system-level checkpoint recovery method (Research and Analysis of Du Yunfei's Fault-Tolerant Parallel Algorithm, National University of Defense Technology, Doctoral Thesis, 2008, pages 7-12, 30-32), T c is the time required to perform a system-level checkpoint, T rc is the recovery time for a failure. The unshaded part in the figure is the time for the system to execute computing tasks. Assume that the cross-shaped position is the time point when the fault occurs, and the triangular position is the position where the program continues to execute after the fault is recovered. Taking the petascale supercomputer system "Tianhe-1" as an example, the mean time between failures (MTBF) of the system is several hours, and the time required to perform a system-level checkpoint T c It has reached more than ten minutes, the time T required to perform fault recovery rc Than T...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com