System and method for cluster fault toleration

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A cluster and operating system technology, applied in the direction of response to error generation, error detection/correction, redundant data error detection in calculations, etc., can solve checkpoint operation failure, system reliability decline, cluster system components Increased number and other issues to achieve good scalability

Active Publication Date: 2009-03-04

INST OF COMPUTING TECH CHINESE ACAD OF SCI

View PDF0 Cites 20 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Both of these existing technologies need to run in the node where the target process is located, and their shortcoming is that the checkpoint operation cannot run when the target node fails

Therefore, the existing cooperative parallel application checkpoint technology has the following disadvantages: first, it is necessary to periodically perform checkpoint operations on all processes in a parallel application, resulting in a large time overhead; second, the checkpoint image file occupies a huge amount of storage resources , in order to meet the storage requirements of checkpoints, the number of components in the cluster system will increase, resulting in an increase in system cost, but a decrease in the overall reliability of the system

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

specific Embodiment approach

[0100] As a specific implementation, the method for monitoring faults described herein includes:

[0101] (1) Judging the failure of the operating system according to the clock interruption count; when the clock interruption count does not increase within a predetermined time, it is judged as the failure of the operating system.

[0102] For common general-purpose operating systems such as Unix and Linux, operating system functions such as process scheduling, system resource monitoring, and system time maintenance all depend on clock interrupts. The frequency of the clock interrupt is generally set at 100 to 1000 times per second. Each operating system uses a specific variable as the count of clock interrupts processed. If the clock interrupt count variable does not increase within a specified period of time, such as 0.05 seconds, it can be determined that the operating system is seriously malfunctioning.

[0103] (2) According to the failure of calling the internal interfac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a system for cluster fault tolerance and a method thereof. The system comprises a checkpoint server which is connected with a plurality of nodes through networks and used for collecting information, which is applied in parallel, of all of processes, sending monitoring request to the nodes and responding to the checkpoint operation request from the nodes as well as saving checkpoint files on a checkpoint file server; on the completion of cutting operation of the checkpoints, the checkpoints are recovered; the checkpoint file server is connected with a plurality of nodes through networks, used for storing checkpoint files, and provides support for accessing checkpoint files during process recovery; a fault monitoring module is arranged on the nodes, and used for monitoring the running status of the operation systems of the local nodes and the specified running status of the specified process in the monitor request, and the specified status of the specified hardware components in the monitoring request according to the monitoring request, as well as sending checkpoint operation request to the checkpoint server when faults are monitored.

Description

technical field [0001] The invention relates to cluster fault tolerance, in particular to a method and system for cluster fault tolerance based on process checkpoint cut and recovery. Background technique [0002] The cluster is the mainstream structure of the current high-performance computer, and its nodes and interconnection network usually use off-the-shelf commercial components rather than custom-made. The openness and scalability of this hardware platform make the cluster have excellent advantages compared to traditional mainframes, massively parallel processing systems (Massively Parallel Processors, MPPs) and symmetric multiprocessing systems (Symmetric MultiProcessors, SMPs). cost performance. With the continuous expansion of the scale and complexity of the cluster system, its reliability shows a downward trend. The problem of fault tolerance in cluster systems has attracted extensive attention from academia and industry. Exploring cluster fault-tolerant mechanis...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F11/00G06F11/14

Inventor霍志刚

OwnerINST OF COMPUTING TECH CHINESE ACAD OF SCI

System and method for cluster fault toleration

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

specific Embodiment approach

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology