Check patentability & draft patents in minutes with Patsnap Eureka AI!

A Fault-Tolerant Method for Checkpoint-Based Computers

A checkpoint and computer technology, applied in the computer field, can solve the problems of inability to meet the I/O bandwidth requirements of fast rewinding parallel file systems, snatch computing resources and memory resources, and fail to use them effectively, so as to reduce I/O access. and bandwidth requirements, speeding up the rollback, and improving the effect of resource utilization

Active Publication Date: 2020-01-21
INST OF COMPUTING TECH CHINESE ACAD OF SCI +1
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method uses the page size as the block size of the process state data. This method will cause the checkpoint to determine the granularity of the process state data modification content, and the checkpoint based on the page protection mechanism requires the support of the operating system and hardware. The checkpoint Using the copy-on-write technology of the operating system will also cause the parent and child processes to rob computing resources and memory resources
This method does not utilize the idle computing resources during the execution of the checkpoint process, and does not effectively utilize the I / O bandwidth of the parallel file system, and cannot satisfy the I / O of the parallel file system that quickly rolls back and reduces rollback in the event of an error bandwidth requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Fault-Tolerant Method for Checkpoint-Based Computers
  • A Fault-Tolerant Method for Checkpoint-Based Computers
  • A Fault-Tolerant Method for Checkpoint-Based Computers

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] In order to make the objectives, technical solutions, design methods and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

[0027] figure 1 A flow chart of performing a checkpoint according to one embodiment of the present invention is shown. In short, the checkpoint mechanism refers to setting a checkpoint at an appropriate time when the process is running normally, saving the process state data (or checkpoint file) to the stable storage, and if a failure occurs during the subsequent running process, the process state data is saved. Read from memory to perform rewind / restore operations of the process. Specific steps are as follows:

[0028] Step 101: At the checkpoint moment, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a fault tolerance method based on a check point for a computer. The method comprises the following steps: when a judgment of executing the check point is made, suspending a user process; segmenting process state data of the user process by using idle computing resources in the computer, and computing a hash value of each block in order to determine blocks needing to be saved; and saving the determined blocks needing to be saved and the corresponding hash values while computing the hash values of the blocks in order to form a check point file for recovering a wrong user process. Through adoption of the method provided by the invention, the idle computing resources in a super computer and an I / O (Input / Output) bandwidth of a parallel file system can be utilized effectively, so that check point execution time and check point rollback time are shortened.

Description

technical field [0001] The present invention relates to the field of computer technology, in particular to a fault-tolerant method for computers (especially supercomputers) based on checkpoints. Background technique [0002] With the development of information technology, the number of nodes and processors of supercomputers continues to increase, and the performance is also doubled. However, according to statistics, the mean time between failures (MTBF, Mean TimeBetween Failure) of the entire supercomputer system has been reduced to Only a few hours. For example, the Tianhe-2 supercomputer in China consists of 16,000 nodes, each with 2 Ivy Bridge-E Xeon E5 2692-based processors and 3 Xeon Phi co-processors, for a total of 32,000 Ivy Bridge processors and 48,000 Xeon Phi co-processors, with a total of 3.12 million computing cores. If the MTBF of each processor in the Tianhe-2 supercomputer is 876000 hours (100 years), then the MTBF of the entire Tianhe-2 is 876000 / (48000+32...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F11/14
CPCG06F11/1407
Inventor 严明玉张志敏吴军龚健张浩孙凝晖
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More