Unlock instant, AI-driven research and patent intelligence for your innovation.

Parallel communication library state self-recovery method facing to process failure fault

A communication library and self-recovery technology, which is applied to the generation of response errors, error detection of redundant data in hardware, etc., can solve problems that do not involve the state recovery strategy of parallel communication libraries, and achieve strong fault tolerance and computational efficiency high effect

Active Publication Date: 2014-03-19
NAT UNIV OF DEFENSE TECH
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this technical solution does not involve the state recovery strategy of the parallel communication library after an error
[0007] To sum up, there is no report in the current patents and literature on the automatic recovery method of the state of the parallel communication library for process failure errors in high-performance parallel computers, and there is an urgent need for high-performance parallel program developers and high-performance computer managers. solve the technical problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel communication library state self-recovery method facing to process failure fault
  • Parallel communication library state self-recovery method facing to process failure fault
  • Parallel communication library state self-recovery method facing to process failure fault

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028]This embodiment aims at the situation that a parallel program on a parallel computer has a process failure and exits due to an error, and proposes a failure-oriented parallel communication library state self-recovery method, which realizes the communication state of a parallel program composed of multiple concurrent processes after failure The autonomous recovery of the parallel program ensures that the parallel program continues to run after an error occurs. For the convenience of description, the process of the parallel program is referred to as the calculation process in the following.

[0029] like figure 2 As shown, the implementation steps of the process failure error-oriented parallel communication library state self-recovery method in this embodiment are as follows:

[0030] 1) Start the job management process and node management process; users submit parallel tasks to the job management process, and the job management process allocates computing nodes according...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a parallel communication library state self-recovery method facing to the process failure fault. The parallel communication library state self-recovery method comprises the following implementation steps of: an operation management process executes non-communication local calculation by the derived calculation process of a node management process; the operation management process monitors the process failure situation, and a failure message is sent to the calculation process by the node management process; the calculation process inquires the failed calculation process number of the time of a shared memory and the failure process list in a global communicator so as to execute the fault recovery operation aiming at the process failure fault; and the failed parallel program is recovered to one consistent state. According to the parallel communication library state self-recovery method disclosed by the invention, the parallel program can not interrupt or exit when meeting the failure process fault, the whole failure parallel program does not need to be loaded again by the operation management system, the failure calculation process can be automatically recovered, and the parallel communication library state self-recovery method has the advantages of strong fault-tolerant ability and high calculation efficiency.

Description

technical field [0001] The invention relates to the technical field of computer parallel computing, in particular to a parallel communication library state self-recovery method of a parallel program using a message passing programming mode after a process failure error occurs during the running of the parallel program. Background technique [0002] In recent years, with the development and popularization of the field of high-performance computing, on the one hand, the scope of use of parallel computers has become wider and wider; on the other hand, the scale of parallel computers has become larger and larger. As the parallel scale of parallel computers expands, the Mean Time Between Failures (MTBF) of parallel computer systems becomes shorter and shorter, so parallel programs (or parallel applications, parallel tasks) fail during operation The probability is also increasing. [0003] The parallel computer contains multiple computing nodes. The resource management system of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F11/20
Inventor 廖湘科卢宇彤谢旻所光曹宏嘉蒋艳凰董勇陈海涛
Owner NAT UNIV OF DEFENSE TECH