Fine-grained low-overhead fault-tolerant system for GPGPU

A fault-tolerant system and low-overhead technology, applied in the field of fault-tolerant systems, can solve the problems of poor fault-tolerant system performance, difficulty in implementing fault-tolerance, and large fault-tolerance granularity, so as to reduce system overhead, avoid storage time overhead, and reduce computing scale.

Active Publication Date: 2019-08-02
HARBIN INST OF TECH
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the software fault tolerance method of GPU is still in its infancy, and there are problems such as large fault tolerance granularity, high cost of error repair, poor performance of fault tolerance system, and difficulty in implementing fault tolerance.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Fine-grained low-overhead fault-tolerant system for GPGPU
  • Fine-grained low-overhead fault-tolerant system for GPGPU
  • Fine-grained low-overhead fault-tolerant system for GPGPU

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0021] refer to figure 1 , the task division module uses the stream computing mode of the CUDA platform to divide the input data set into N data subsets, and then classifies the data-related calculation kernel and data transmission into a single stream, and realizes the parallel execution of N streams at the system level. You can use the two functional functions cudaStreamCreate() and cudaMemcpyAsync() to realize the function of creating a stream and asynchronously transmitting data in the stream. This method utilizes the asynchronous nature of GPGPU calculation and data transmission between GPU and CPU, which can realize the overlap of calculation and data transmission on the time axis, thereby hiding the time delay caused by data transmission, improving system performance, and because each The calculation scale of the kernel is reduced, which can reduce the calculation amount of recalculation during error correction.

[0022] refer to figure 2 , since the execution of the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a fine-grained low-overhead fault-tolerant system for a GPGPU. The fine-grained low-overhead fault-tolerant system comprises a task division module, a check point backup module,a redundancy execution and error detection module and an error repair module. The fault-tolerant processing of the instantaneous fault of the GPU computing component can be realized, and the problemsof large fault-tolerant granularity, high error repair cost, poor fault-tolerant system performance and the like in the traditional software fault-tolerant method of the GPU can be solved. The beneficial effects of the invention are as follows: thread tasks can be divided; the calculation scale of the kernel is reduced, only the relative active variables need to be backed up during check point backup, the space-time expenditure caused by storage is reduced, only part of objects related to errors need to be recalculated during error repair, the fault-tolerant cost caused by recalculation is reduced, and the asynchronous mechanism of the CPU-GPU heterogeneous system is fully utilized to hide the time delay caused by data transmission and improve the performance of the system.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a GPGPU-oriented fine-grained and low-overhead fault-tolerant system. Background technique [0002] In recent years, general-purpose graphics processor units (GPGPUs) have become increasingly popular due to their superior computing power, memory access bandwidth, and improved programmability. Heterogeneous parallel computers that leverage the computing power of GPUs for high-performance computing have been favored by researchers in the vast majority of scientific fields, including financial analysis, earthquake detection, high-energy physics, quantum chemistry, molecular dynamics, and even drug design. [0003] Since the GPU is initially mainly used in the field of graphics and image processing, and the application in this field itself has a certain degree of fault tolerance, the error of the calculation result of a single pixel does not affect the display effect of the entire ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F11/14
CPCG06F11/1458G06F11/1428Y02D10/00
Inventor 季振洲郭明周李金宇
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products