System and method for cluster fault toleration

A cluster and operating system technology, applied in the direction of response to error generation, error detection/correction, redundant data error detection in calculations, etc., can solve checkpoint operation failure, system reliability decline, cluster system components Increased number and other issues to achieve good scalability

Active Publication Date: 2009-03-04
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Both of these existing technologies need to run in the node where the target process is located, and their shortcoming is that the checkpoint operation cannot run when the target node fails
Therefore, the existing cooperative parallel application checkpoint technology has the following disadvantages: first, it is necessary to periodically perform checkpoint operations on all processes in a parallel application, resulting in a large time overhead; second, the checkpoint image file occupies a huge amount of storage resources , in order to meet the storage requirements of checkpoints, the number of components in the cluster system will increase, resulting in an increase in system cost, but a decrease in the overall reliability of the system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for cluster fault toleration
  • System and method for cluster fault toleration
  • System and method for cluster fault toleration

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0100] As a specific implementation, the method for monitoring faults described herein includes:

[0101] (1) Judging the failure of the operating system according to the clock interruption count; when the clock interruption count does not increase within a predetermined time, it is judged as the failure of the operating system.

[0102] For common general-purpose operating systems such as Unix and Linux, operating system functions such as process scheduling, system resource monitoring, and system time maintenance all depend on clock interrupts. The frequency of the clock interrupt is generally set at 100 to 1000 times per second. Each operating system uses a specific variable as the count of clock interrupts processed. If the clock interrupt count variable does not increase within a specified period of time, such as 0.05 seconds, it can be determined that the operating system is seriously malfunctioning.

[0103] (2) According to the failure of calling the internal interfac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a system for cluster fault tolerance and a method thereof. The system comprises a checkpoint server which is connected with a plurality of nodes through networks and used for collecting information, which is applied in parallel, of all of processes, sending monitoring request to the nodes and responding to the checkpoint operation request from the nodes as well as saving checkpoint files on a checkpoint file server; on the completion of cutting operation of the checkpoints, the checkpoints are recovered; the checkpoint file server is connected with a plurality of nodes through networks, used for storing checkpoint files, and provides support for accessing checkpoint files during process recovery; a fault monitoring module is arranged on the nodes, and used for monitoring the running status of the operation systems of the local nodes and the specified running status of the specified process in the monitor request, and the specified status of the specified hardware components in the monitoring request according to the monitoring request, as well as sending checkpoint operation request to the checkpoint server when faults are monitored.

Description

technical field [0001] The invention relates to cluster fault tolerance, in particular to a method and system for cluster fault tolerance based on process checkpoint cut and recovery. Background technique [0002] The cluster is the mainstream structure of the current high-performance computer, and its nodes and interconnection network usually use off-the-shelf commercial components rather than custom-made. The openness and scalability of this hardware platform make the cluster have excellent advantages compared to traditional mainframes, massively parallel processing systems (Massively Parallel Processors, MPPs) and symmetric multiprocessing systems (Symmetric MultiProcessors, SMPs). cost performance. With the continuous expansion of the scale and complexity of the cluster system, its reliability shows a downward trend. The problem of fault tolerance in cluster systems has attracted extensive attention from academia and industry. Exploring cluster fault-tolerant mechanis...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F11/00G06F11/14
Inventor 霍志刚
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products