Loading [a11y]/accessibility-menu.js
A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications | IEEE Journals & Magazine | IEEE Xplore

A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications


Abstract:

The wide adoption of graphics processing units (GPUs) as accelerators for general-purpose applications makes the end-to-end reliability implications of their use increasi...Show More

Abstract:

The wide adoption of graphics processing units (GPUs) as accelerators for general-purpose applications makes the end-to-end reliability implications of their use increasingly significant. Fault injection is a widely adopted method to evaluate the resilience of applications. However, building a fault injector for general-purpose GPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient. This paper makes four key contributions. First, it presents a fault-injection methodology to evaluate the end-to-end reliability properties of application kernels running on GPUs. Second, it introduces GPU-Qin, a fault-injection tool that uses real GPU hardware and offers a tunable and efficient balance between the representativeness and the cost of a fault-injection campaign. Third, it characterizes the error resilience characteristics of seventeen application kernels. Finally, it provides preliminary insights on correlations between the algorithmic properties of applications and their error resilience.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 27, Issue: 12, 01 December 2016)
Page(s): 3397 - 3411
Date of Publication: 07 March 2016

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.