Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Operation fault location method for concurrent job

A locating method and technology for operating faults, applied in the field of high-performance computing, can solve the problems of no fault locating method and no guarantee of fault locating timeliness.

Active Publication Date: 2018-10-09
SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
View PDF6 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to improve the reliability of parallel job operation and the ability to analyze the impact of abnormal factors, the state analysis method of parallel job operation in the prior art can obtain the characteristics of job operation (mainly including state characteristic parameters based on qualitative information and quantification based on performance parameters characteristic parameters) and fault information, to detect faults in time; but the method in the prior art does not provide a specific fault location method
[0004] For example, Gao Jian, Yu Kang et al. proposed a message-passing-based Fault detection and analysis method, this method does not classify and classify faults, and the timeliness of fault location is not guaranteed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Operation fault location method for concurrent job
  • Operation fault location method for concurrent job
  • Operation fault location method for concurrent job

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0081] A method for locating a parallel job running fault, comprising the following steps:

[0082] 1) Get system information

[0083] The system information includes job status, computing node status, network system status, file system status, and job and resource management system status; computing node status, network system status, file system status, and job and resource management system status are dependent on job operation The status of system resources; the system information is obtained through existing system monitoring and management tools;

[0084] From the user submitting the job to the system, to the completion and exit of the job, each stage has a corresponding job status. The job status transition relationship is as follows figure 1 shown.

[0085] The job status is the running status of the job program submitted by the user in the high-performance computing system; the meaning of the job status is as follows:

[0086] 1.1) PEND: The job is being scheduled...

Embodiment 2

[0142] As in the parallel job running fault location method described in Embodiment 1, further, the job status 1.1)-1.5) is obtained through the job status query command; the job status 1.6) is obtained by monitoring the output of the job running log and job data; if the job If the status shows RUN, but the output of the job data has stopped, it is judged that the job is hung;

[0143] If there is no result output after executing the job status query command, it is determined that the job management master control in the job and resource management system is faulty, and the event severity level is 1; further check the job management master control status;

[0144] On the system console or the user login node, use the computing node status query command cnload to obtain the computing node status, network system status, and file system status corresponding to the job;

[0145] In addition to the fault information obtained through the system monitoring management tool, the fault ...

Embodiment 3

[0147] As in the parallel job running fault location method described in Embodiment 1, further, for the error messages that appear during the job running and the error messages in the logs, the causes of the faults and handling suggestions are given through the associated knowledge base. For example, the reason for the error "Exceed user avail resource quota" returned by the job is that the disk exceeds the limit; the reason for "Connect tormsctld failed" is the failure of the overall resource control.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an operation fault location method for concurrent job. The method comprises: aiming at the faults occurring during the operation process of the concurrent job, firstly listingall the reasons causing the faults, and classifying and grading the reasons; then establishing a fault location analysis method through the problem scale and the associated relationship thereof, andchecking the fault reasons layer by layer from top to bottom, thereby reducing the fault processing range, and effectively solving the problems of high fault location and poor accuracy in the high performance computing system.

Description

technical field [0001] The invention relates to a method for locating faults in parallel job operation, and belongs to the technical field of high-performance computing. Background technique [0002] With the increasing scale of solving problems, there are more and more massively parallel computing tasks. Due to the large and complex structure of the high-performance computing system and the huge number of nodes involved in large-scale computing, various software and hardware resource failures often occur during the running of the job, which leads to the failure to submit the job, or the job is submitted but is always in the PEND state, After the job is submitted and run, it ends abnormally, the job hangs and other problems. At present, using existing system resource monitoring and management tools, the job running status can be obtained to a certain extent, including job status, system resource status and fault information on which job running depends. However, on the one...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): H04L12/24H04L12/26G06F9/50
CPCG06F9/5027H04L41/06H04L41/0677H04L41/069H04L43/0817
Inventor 朱光慧曾云辉刘晓旭
Owner SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products