Operation fault location method for concurrent job

A locating method and technology for operating faults, applied in the field of high-performance computing, can solve the problems of no fault locating method and no guarantee of fault locating timeliness.
CN108632086AActive Publication Date: 2018-10-09SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN

Patent Information

Authority / Receiving Office
CN · China
Current Assignee / Owner
SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
Publication Date
2018-10-09

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention relates to an operation fault location method for concurrent job. The method comprises: aiming at the faults occurring during the operation process of the concurrent job, firstly listingall the reasons causing the faults, and classifying and grading the reasons; then establishing a fault location analysis method through the problem scale and the associated relationship thereof, andchecking the fault reasons layer by layer from top to bottom, thereby reducing the fault processing range, and effectively solving the problems of high fault location and poor accuracy in the high performance computing system.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to a method for locating faults in parallel job operation, and belongs to the technical field of high-performance computing. Background technique

[0002] With the increasing scale of solving problems, there are more and more massively parallel computing tasks. Due to the large and complex structure of the high-performance computing system and the huge number of nodes involved in large-scale computing, various software and hardware resource failures often occur during the running of the job, which leads to the failure to submit the job, or the job is submitted but is always in the PEND state, After the job is submitted and run, it ends abnormally, the job hangs and other problems. At present, using existing system resource monitoring and management tools, the job running status can be obtained to a certain extent, including job status, system resource status and fault information on which job running depends. However, on the one...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More