Parallel communication library state self-recovery method facing to process failure fault
A communication library and self-recovery technology, which is applied to the generation of response errors, error detection of redundant data in hardware, etc., can solve problems that do not involve the state recovery strategy of parallel communication libraries, and achieve strong fault tolerance and computational efficiency high effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0028]This embodiment aims at the situation that a parallel program on a parallel computer has a process failure and exits due to an error, and proposes a failure-oriented parallel communication library state self-recovery method, which realizes the communication state of a parallel program composed of multiple concurrent processes after failure The autonomous recovery of the parallel program ensures that the parallel program continues to run after an error occurs. For the convenience of description, the process of the parallel program is referred to as the calculation process in the following.
[0029] like figure 2 As shown, the implementation steps of the process failure error-oriented parallel communication library state self-recovery method in this embodiment are as follows:
[0030] 1) Start the job management process and node management process; users submit parallel tasks to the job management process, and the job management process allocates computing nodes according...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 