Operation fault location method for concurrent job
Patent Information
- Authority / Receiving Office
- CN · China
- Current Assignee / Owner
- SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
- Publication Date
- 2018-10-09
Smart Images

Figure 1 
Figure 2 
Figure 3
Abstract
Description
technical field
[0001] The invention relates to a method for locating faults in parallel job operation, and belongs to the technical field of high-performance computing. Background technique
[0002] With the increasing scale of solving problems, there are more and more massively parallel computing tasks. Due to the large and complex structure of the high-performance computing system and the huge number of nodes involved in large-scale computing, various software and hardware resource failures often occur during the running of the job, which leads to the failure to submit the job, or the job is submitted but is always in the PEND state, After the job is submitted and run, it ends abnormally, the job hangs and other problems. At present, using existing system resource monitoring and management tools, the job running status can be obtained to a certain extent, including job status, system resource status and fault information on which job running depends. However, on the one...