Anomaly Detection System for HPC Large-Scale Parallel Programs Based on Message Passing

A technology for message passing and program exceptions, applied in the field of detection systems, can solve problems such as poor scalability and high overhead, and achieve the effect of reducing resource waste, reducing time overhead, reducing complexity and energy consumption

Inactive Publication Date: 2019-03-26
凯习(北京)信息科技有限公司
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] In order to determine the root causes of HPC large-scale parallel program execution failures, the purpose of the present invention is to provide a detection system based on message passing-based automatic monitoring of program exceptions and discrimination of software and hardware causes. message transmission, using the passive heartbeat mechanism, to realize the abnormal automatic alarm and trigger the location of suspicious events during the execution of HPC large-scale parallel programs; , high overhead, and poor scalability; on the other hand, it realizes the automatic alarm and detection of abnormalities in the execution process of HPC large-scale parallel programs, accurately locates hardware faults, and provides the most likely candidates for the root cause of software errors. item

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Anomaly Detection System for HPC Large-Scale Parallel Programs Based on Message Passing
  • Anomaly Detection System for HPC Large-Scale Parallel Programs Based on Message Passing
  • Anomaly Detection System for HPC Large-Scale Parallel Programs Based on Message Passing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0159] Most high-performance computing applications use message passing for inter-process communication. This type of program runs on a large scale and takes a long time. Message passing generally exists during program execution. The present invention monitors the abnormality of message passing behavior through a simplified heartbeat mechanism. Once the set suspicious event threshold is triggered, the nodes in the HPC will be polled and detected. On the one hand, it can detect abnormalities in the process of program execution in a timely manner, and on the other hand, it can solve the problem that the program execution abnormality or failure is caused by software. It is caused by hardware, which is a problem that plagues development, debugging, and management personnel. This prevents users from consuming too much energy to determine the source of the problem, and enables more targeted system maintenance and software debugging.

[0160] see Figure 11In the Linpack performance...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a message-transmission based detection system for automatic monitoring of an HPC large-scale concurrent program exception and hardware-hardware cause judgment. The system overcomes the problems that a centralized detection mechanism requires the large performance cost and has poor expansibility. Through exception monitoring of a message transmission behavior, a passive heartbeat mechanism sets a message monitoring timer for a working process on each node; a heartbeat message is sent to a main control node only when the exception takes place to the message behavior; and under normal situations, heartbeats do not need to be sent. In this way, occupation of network resources is avoided; the expansibility is not limited; and state inspection is only conducted when necessary through application of a suspicious event positioning mechanism. According to the invention, the performance cost caused by execution of an MPI program can be ignored; and the system can easily expand and support judgment of software-hardware causes of running errors of large-scale concurrent application programs on a high-performance computer at running and debugging stages.

Description

technical field [0001] The invention relates to an abnormality detector applied to HPC large-scale parallel programs, and a detection system for performing software and hardware judgment on the cause of the failure of the HPC large-scale parallel programs. More specifically, it is a detection system that uses a passive heartbeat mechanism to automatically trigger abnormal alarms based on message transmission, and detects and judges the cause of abnormal software and hardware through a suspicious event location mechanism. Background technique [0002] High-performance computing (high performance computing, HPC) has a large scale, complex structure, and powerful computing power. From understanding the protein folding process to predicting short-term and long-term climate patterns, massively parallel HPC simulations have become the tool of choice. These applications can Run detailed numerical simulations to model the real world and enable breakthroughs in science and engineerin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F11/30G06F11/32G06F11/07
CPCG06F11/0715G06F11/0757G06F11/079G06F11/3017G06F11/302G06F11/3055G06F11/3065G06F11/327G06F2201/865
Inventor 刘轶张国振
Owner 凯习(北京)信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products