Fault-tolerance method of large-scale heterogeneous parallel computing

A parallel computing, large-scale technology, applied in the computer field, can solve problems such as project suspension, failure to take into account the realization of many-core-level fault-tolerant functions, and inability to automatically detect hardware failures, so as to ensure reliability and stability and reduce failures. The effect of recovery time

Active Publication Date: 2013-02-13
JIANGNAN INST OF COMPUTING TECH
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, in terms of fault tolerance mechanism and intermittent processing mechanism at the parallel algorithm level, the parallel computing software in the main application fields has designed and developed a large-scale parallel algorithm with arbitrarily variable parallel scale retention and recovery functions, which can ensure that the algorithm is at the MPI level (“ Message Passing Interface", but at the many-core parallel level, due to the particularity and complexity of architectures such as GPU ("Graphics Processing Unit", Graphic Processing Unit) and Cell processors, there are few The application program considers the implementation of the fault-tolerant function at the many-core level. During the calculation process, it cannot automatically detect the hardware failure of the large-scale heterogeneous computer system at the processor core level. It can only judge whether the calculation is normal and reliable from the final calculation result. Reliability and Stability of Scale Heterogeneous Parallel Computing
For some medium and large-scale many-core parallel projects with long computing time, it often happens that the project is suspended, and manual intervention is required to resubmit
[0004] Taking the application field of numerical simulation of the entire watershed of aerospace vehicles as an example, according to the currently available literature, the existing heterogeneous many-core parallel only realizes the recording of intermediate calculation results, that is, realizes the general retention and recovery function, and does not take into account the many-core level. The fault-tolerant function is realized. During the calculation process, the hardware failure of the large-scale heterogeneous computer system at the processor core level cannot be automatically detected. It can only be judged from the final calculation result whether the calculation is normal and reliable.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Fault-tolerance method of large-scale heterogeneous parallel computing
  • Fault-tolerance method of large-scale heterogeneous parallel computing
  • Fault-tolerance method of large-scale heterogeneous parallel computing

Examples

Experimental program
Comparison scheme
Effect test

no. 1 Embodiment approach

[0038] figure 1 It is a schematic flowchart of the first embodiment of the fault-tolerant method for large-scale heterogeneous parallel computing of the present invention. refer to figure 1 , the first embodiment includes the following steps:

[0039] Step S101 is executed to assign the content of the calculation array of the calculation task to the backup array of the calculation array.

[0040] Step S102 is executed to count the number of available processor cores to obtain the first number of processor cores.

[0041] Executing step S103, the core calculation module can be operated in parallel by the processor core. It should be noted that in large-scale computing projects, the core computing module is usually a part of the core loop with a relatively concentrated amount of computation. The usual practice is to decompose the tasks of the core computing module and hand them over to each processor to complete in parallel. Therefore, the correctness of the core computing ...

no. 2 Embodiment approach

[0055] figure 2 It is a schematic flowchart of the second embodiment of the fault-tolerant method for large-scale heterogeneous parallel computing of the present invention. Different from the first specific embodiment, the second specific embodiment counts the time spent by each processor core participating in the core computing module after ensuring that the calculation of the core computing module is correct. Moreover, the second specific embodiment shows the execution process of multiple core computing modules in one time step. refer to figure 2 , the second specific embodiment includes the following steps:

[0056] Step S201 is executed to assign the content of the calculation array of the calculation task to the backup array of the calculation array.

[0057] Step S202 is executed to count the number of available processor cores to obtain the first number of processor cores.

[0058] Step S203 is executed, and the core calculation module can be operated in parallel ...

no. 3 Embodiment approach

[0084] image 3 It is a schematic flowchart of the third embodiment of the fault-tolerant method for large-scale heterogeneous parallel computing of the present invention. Different from the second specific embodiment, the third specific embodiment performs many-core task decomposition according to the first processor core number after counting the number of first processor cores, and shows the execution process of all time steps, and After the iterative calculation of each time step is completed, statistics and early warnings are made on the calculation status of each processor core. refer to image 3 , the third embodiment includes the following steps:

[0085] Step S301 is executed to determine whether the iterative calculation of all time steps has ended. If so, end.

[0086] Otherwise, continue to execute step S302, and assign the content of the calculation array of the calculation task to the backup array of the calculation array.

[0087] Step S303 is executed to c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a fault-tolerance method of large-scale heterogeneous parallel computing. The method includes treating each core computing module of each time step as follows: contents in computing digit groups of a computing subject are assigned to backup digit groups and computing for completing the core computing modules is performed; wherein the computing for completing the core computing modules includes that the number of available processor cores is computed to obtain the number of a first processor cores; the available processor cores perform parallel computing on the core computing modules; the number of the available processor cores is computed for a second time to obtain the number of a second processor cores; and the number of the first processor cores and the number of the second processor cores are compared, if the number of the first processor cores is smaller than that of the second processor cores, contents of backup digit groups are assigned to the computing digit groups, and computing for completing the core computing modules is completed for a second time until the number of the first processor cores is consistent with that of the second processor cores. According to the fault-tolerance method of large-scale heterogeneous parallel computing, computing resources can be fully used, fault recovery time is reduced, and the reliability of parallel computing is improved.

Description

technical field [0001] The invention relates to the field of computers, in particular to a fault-tolerant method for large-scale heterogeneous parallel computing. Background technique [0002] Large-scale heterogeneous high-performance computer systems are an important development direction for extremely large-scale parallel computing in the future. Compared with traditional single-core / multi-core processor computer systems, large-scale heterogeneous high-performance computer systems are based on heterogeneous processors. The number of processor cores has increased dramatically, and the system architecture and memory access methods have undergone major changes. In the environment of large-scale heterogeneous computer systems, how to ensure the reliability and stability of large-scale parallel computing is a key issue, and the fault-tolerant mechanism and intermittent processing mechanism at the parallel algorithm level are one of the key technologies. Efficient algorithm-le...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F11/07
Inventor 陈德训刘鑫李芳徐金秀
Owner JIANGNAN INST OF COMPUTING TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products