A Method of Writing Checkpoint Data in Massively Parallel Systems Based on Random Latency Alleviating I/O Bottleneck

A random delay and system check technology, applied in the field of high-performance computing, can solve problems such as unrecoverable operation of node-related processes, loss of checkpoint data, I/O system impact, etc., to improve scalability, reduce performance loss, and slow down The effect of writing peaks

Inactive Publication Date: 2020-07-17
凯习(北京)信息科技有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

After the operation of collecting checkpoint data is completed, the checkpoint software will write the checkpoint data directly to the external storage system by default to deal with possible node downtime failures (such as a node downtime, and the current node is not If the checkpoint data of the node is written to the stable storage, the corresponding checkpoint data of the node will be lost, and the operation of the node-related processes cannot be restored)
Since the number of I / O nodes in the system is far smaller than the number of computing nodes, the centralized writing of checkpoint data by a large number of computing nodes will have an impact on the I / O system, thereby forming a system bottleneck. become more prominent as the size increases

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Writing Checkpoint Data in Massively Parallel Systems Based on Random Latency Alleviating I/O Bottleneck
  • A Method of Writing Checkpoint Data in Massively Parallel Systems Based on Random Latency Alleviating I/O Bottleneck
  • A Method of Writing Checkpoint Data in Massively Parallel Systems Based on Random Latency Alleviating I/O Bottleneck

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0075] Such as image 3 As shown, assume that the total bandwidth of the I / O subsystem is 100GB / s, and there are a total of 16,000 nodes to write checkpoint files. If the amount of checkpoint data written by each node is 10MB, regardless of I / O Under the ideal environment of conflict, it takes 160GB / s to finish writing within 1s; if the delay is 5s, then an average of 32GB / s needs to be written within 5s; if the delay is 10s, only 16G / s needs to be written per second. Occupancy drops below the total system bandwidth. However, the actual writing time should be longer than the theoretical time, because a large number of simultaneous writing causes conflicts and reduces I / O efficiency, and the degree of such conflicts decreases as the delay time increases.

[0076] exist Figure 4 When the delay time shown is 0, it means that the random delay writing method is not used. When the delay time begins to increase, the total write time shows a downward trend due to the reduction of ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a random delay-based large-scale parallel system checkpoint data writing method for relieving an I / O bottleneck. According to the method, checkpoint data is temporarily cachedin a memory, and a checkpoint main process can immediately return to realize separation of a writing process, so that the time of global stop in a checkpoint data operation process is shortened. A preset delay writing time is determined by using a random delay checkpoint file processing method, and writing operation is scattered in time, so that I / O writing peak values at the same moment are reduced and the purpose of relieving the I / O bottleneck is achieved. Before a large-scale parallel system executes I / O operation, associated data information of the large-scale parallel system is periodically detected, and if running of applications is influenced, delay operation is given up and the writing operation is immediately executed, so that the influence on normal running of the applications due to long-time occupation of shared resources is avoided; and on the contrary, the writing continues to be performed according to the determined delay writing time. The pressure on an I / O subsystem caused by the applications in different system platforms compared with a conventional centralized writing mode can be reduced, thereby obtaining higher throughput rate and shorter global blocking time.

Description

technical field [0001] The present invention relates to a processing method for dynamically adjusting the optimal write timing of checkpoint (Checkpoint) data in the field of high-performance computing, more particularly, refers to a method related to alleviating the centralized writing of checkpoint data in a large-scale parallel system that causes I Checkpoint data write control method for / O bottleneck. Background technique [0002] Most high-performance computing adopts a large-scale parallel computing model. The hardware infrastructure of high-performance computing systems includes three parts: computing nodes, network interconnection, and storage systems. The computing nodes are responsible for running computing tasks, and the network interconnection realizes the interconnection between computing nodes and between computing nodes and storage systems. The storage system includes multiple I / O nodes and external storage devices. I / O nodes run parallel file systems. Respo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F3/06
CPCG06F3/0611G06F3/0613G06F3/0614
Inventor 刘轶孙庆峥朱延超
Owner 凯习(北京)信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products