Loading [a11y]/accessibility-menu.js
Fault Tolerant Stencil Computation on Cloud-Based GPU Spot Instances | IEEE Journals & Magazine | IEEE Xplore

Fault Tolerant Stencil Computation on Cloud-Based GPU Spot Instances


Abstract:

This paper describes a fault tolerant framework for distributed stencil computation on cloud-based GPU clusters. It uses pipelining to overlap the data movement with comp...Show More

Abstract:

This paper describes a fault tolerant framework for distributed stencil computation on cloud-based GPU clusters. It uses pipelining to overlap the data movement with computation in the halo region as well as parallelises data movement within the GPUs. Instead of running stencil codes on traditional clusters and supercomputers, the computation is performed on the Amazon Web Service GPU cloud, and utilizes its spot instances to improve cost-efficiency. The implementation is based on a low-cost fault-tolerant mechanism to handle the possible termination of the spot instances. Coupled with a price bidding module, our stencil framework not only optimizes for performance but also for cost. Experimental results show that our framework outperforms the state-of-the-art solutions achieving a peak of 25 TFLOPS for 2-D decomposition running on 512 nodes. We also show that the use of spot instances yields good cost-efficiency, increasing the average TFLOPS/USD from 132 to 360.
Published in: IEEE Transactions on Cloud Computing ( Volume: 7, Issue: 4, 01 Oct.-Dec. 2019)
Page(s): 1013 - 1024
Date of Publication: 31 May 2017

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.