Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A node failure prediction method for large-scale cluster systems

A fault prediction, cluster system technology, applied in information technology support systems, character and pattern recognition, biological neural network models, etc. The effect of high accuracy and strong adaptability

Active Publication Date: 2022-07-12
XI AN JIAOTONG UNIV
View PDF10 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although predicting the next failure of a machine seems to be a feasible and promising solution to improve the reliability of data centers, it brings two main challenges: the first challenge is that high accuracy is required when predicting, especially To reduce false positives
The second challenge is how to choose the appropriate lead time
If the lead time is too long, the salient features before the failure cannot be fully utilized, resulting in low model accuracy; if the lead time is too short, although the prediction accuracy will increase, it is not enough for the administrator to have enough time to correlate the nodes operation to avoid failure

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A node failure prediction method for large-scale cluster systems
  • A node failure prediction method for large-scale cluster systems
  • A node failure prediction method for large-scale cluster systems

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] see Figure 4 , the present invention is a node failure prediction method for large-scale cluster systems. First, collect the resource occupancy data of each node, perform data processing to generate a data set, use a long short-term memory network (LSTM) to build a first data prediction model, and use a random forest. Build the second fault prediction model, establish the data of the first observation window, determine whether the size of the first observation window is equal to 3 hours, if not, return to rebuild; if it is satisfied, use the first data prediction model to predict the data in the advance time window, Combine the first observation window with the data in the advance time window to form a second observation window, and judge whether the size of the second observation window is equal to 4 hours, if not, return to rebuild the second observation window; if so, use the second fault prediction The model predicts failures within the prediction window.

[0031]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a node failure prediction method for a large-scale cluster system, which collects resource occupancy data of each node and generates a data set, uses a long and short-term memory network to build a first data prediction model, and uses a random forest to build a second failure prediction Model, establish the first observation window, judge the size of the first observation window, if it does not meet the set value, return to rebuild; if it meets the set value, use the first fault prediction model to predict the data in the advance time window, The window and the data in the advance time window are combined to form the second observation window, and the size of the second observation window is judged. If the set value is not satisfied, return to rebuild the second observation window; if it is satisfied, use the second fault prediction model to predict the prediction window. failure. On the premise of ensuring sufficient advance time to deal with the node failure, the present invention makes the accuracy rate of the prediction model the highest.

Description

technical field [0001] The invention belongs to the technical field of computer system reliability and availability, and in particular relates to a node failure prediction method for a large-scale cluster system. Background technique [0002] Cluster systems are common platforms used in high-performance computing, cloud computing, and data centers. As these platforms continue to grow in size and complexity, system reliability becomes a major concern, as the system's mean time between failures (MTBF) decreases as the number of system components increases. Recent research results show that the reliability of existing data center and cloud computing systems is limited by a mean time between failures of 10-100 hours. A data center usually has a high failure rate because it has many servers and components. Additionally, long-running applications and intensive workloads are common in these facilities. The performance of the system depends on the availability of the machines, wh...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/62G06V10/764G06V10/82G06N3/04
CPCG06N3/044G06N3/045G06F18/24323Y04S10/50
Inventor 伍卫国毛海聂世强张驰董小社张兴军
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products