Conferences >2014 IEEE International Sympo...

GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of thei...Show More

Metadata

Abstract:

While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault injector for GPGPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient. This paper makes three key contributions. First, it presents the design of a fault-injection methodology to evaluate end-to-end reliability properties of application kernels running on GPUs. Second, it introduces a fault-injection tool that uses real GPU hardware and offers a good balance between the representativeness and the efficiency of the fault injection experiments. Third, this paper characterizes the error resilience characteristics of twelve GPGPU applications.

Published in: 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Date of Conference: 23-25 March 2014

Date Added to IEEE Xplore: 26 June 2014

ISBN Information:

DOI: 10.1109/ISPASS.2014.6844486

Conference Location: Monterey, CA, USA