Cluster fault-tolerance system, apparatus and method

A fault-tolerant system and cluster technology, applied in the direction of error detection/correction, instrumentation, electrical digital data processing, etc., can solve the difficulty of cluster node acquisition and application status recovery, system reliability decline, and failure to realize cluster fault processing space localization sexual issues

Inactive Publication Date: 2009-02-18
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 57 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, fleet fault tolerance faces more challenges
First of all, when the scale of the cluster system continues to expand, according to statistical laws, the reliability of the entire system will inevitably decline
Second, the parallelism between cluster nodes makes it more difficult to full

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cluster fault-tolerance system, apparatus and method
  • Cluster fault-tolerance system, apparatus and method
  • Cluster fault-tolerance system, apparatus and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0101] In order to make the purpose, technical solution and advantages of the present invention clearer, a cluster fault-tolerant system, device and method of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0102] The cluster fault-tolerant system pursues the following goals: 1) Wide adaptability to applications. Cluster fault tolerance should be as independent as possible from applications, parallel application middleware, and not even dependent on the node operating system, so as to facilitate application developers and system managers, that is, the fault tolerance mechanism should be as transparent as possible to applications. 2) Low overhead under conditions of trouble-free operation of the system. The damage caused by the fault tolerance mec...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cluster fault tolerance system, a device and a method. The system includes: a remote checkpoint server which is used to respond the remote checkpoint request from a faulty node and execute the checkpoint operation; a node faulty checking module which is used to monitor the operating system of a local node and the running state of an assigned process and to trigger a remote checkpoint; and a communication system checkpoint module which is used to realize the checkpoint of the communication device and support the recovery function of communication breakpoint. The invention provides localized fast fault restoration for the parallel processing cluster, has a lower overhead and good expansibility, and makes ideal availability for the cluster system with ten billions and hundred billions calculation scale.

Description

technical field [0001] The invention relates to the field of computer parallel processing fault tolerance technology, in particular to a parallel processing cluster fault tolerance system, device and method. Background technique [0002] The application of computer parallel processing technology represented by clusters in modern society has reached considerable breadth and depth. As an important part of the social information infrastructure, the reliability of parallel processing in the cluster system has already had an impact on the economy and society. At present, with the continuous expansion of the cluster system scale and the gradual increase in complexity, the reliability of its parallel processing has shown a downward trend, which has attracted widespread attention from academia and industry. The demand for its engineering application is increasingly urgent. [0003] A cluster is a parallel computer system composed of multiple independent computers (called nodes of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F11/00
Inventor 霍志刚
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products