research-article

GPU snapshot: checkpoint offloading for GPU-dense systems

Authors:
Kyushick Lee

University of Texas at Austin

University of Texas at Austin
View Profile

,
Michael B. Sullivan

NVIDIA

NVIDIA
View Profile

,
Siva Kumar Sastry Hari

NVIDIA

NVIDIA
View Profile

,
Timothy Tsai

NVIDIA

NVIDIA
View Profile

,
Stephen W. Keckler

NVIDIA

NVIDIA
View Profile

,
Mattan Erez

University of Texas at Austin

University of Texas at Austin
View Profile

ICS '19: Proceedings of the ACM International Conference on SupercomputingJune 2019Pages 171–183https://doi.org/10.1145/3330345.3330361

Published:26 June 2019Publication History

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 171–183

ABSTRACT

Future High-Performance Computing (HPC) systems will likely be composed of accelerator-dense heterogeneous computers because accelerators are able to deliver higher performance at lower costs, socket counts and energy consumption. Such accelerator-dense nodes pose a reliability challenge because preserving a large amount of state within accelerators for checkpointing incurs significant overhead. Checkpointing multiple accelerators at the same time, which is necessary to obtain a consistent coordinated checkpoint, overwhelms the host interconnect, memory and IO band-widths. We propose GPU Snapshot to mitigate this issue by: (1) enabling a fast logical snapshot to be taken, while actual check-pointed state is transferred asynchronously to alleviate bandwidth hot spots; (2) using incremental checkpoints that reduce the volume of data transferred; and (3) checkpoint offloading to limit accelerator complexity and effectively utilize the host. As a concrete example, we describe and evaluate the design tradeoffs of GPU Snapshot in the context of a GPU-dense multi-exascale HPC system. We demonstrate 4--40X checkpoint overhead reductions at the node level, which enables a system with GPU Snapshot to approach the performance of a system with idealized GPU checkpointing.

References

J. Tuck A. Agrawal, G. Loh. 2017. Leveraging Near Data Processing for High-Performance Checkpoint/Restart. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
Saurabh Agarwal, Rahul Garg, Meeta S Gupta, and Jose E Moreira. 2004. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the International Supercomputing Conference (ISC). ACM. Google ScholarDigital Library
Abhinav Agrawal, Gabriel H. Loh, and James Tuck. 2017. Leveraging Near Data Processing for High Performance Checkpoint/Restart. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE. Google ScholarDigital Library
Samer Al-Kiswany, Matei Ripeanu, Sudharshan S Vazhkudai, and Abdullah Gharaibeh. 2008. stdchk: A checkpoint storage system for desktop grid computing. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS). IEEE, 613--624. Google ScholarDigital Library
Lorenzo Alvisi and Keith Marzullo. 1998. Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Transactions on Software Engineering 24, 2 (1998), 149--159. Google ScholarDigital Library
Maciej Besta and Torsten Hoefler. 2014. Fault tolerance for remote memory access programming models. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM, 37--48.Google ScholarDigital Library
Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, et al. 2016. Accelerating science with the NERSC burst buffer early user program. CUG2016 Proceedings (2016).Google Scholar
Yong Chen. 2011. Towards Scalable I/O Architecture for Exascale Systems. In Proceedings of the Workshop on Many Task Computing on Grids and Supercomputers (MTAGS). 43--48. Google ScholarDigital Library
Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live migration of virtual machines. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, 273--286. Google ScholarDigital Library
C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguezb, and F. Cappello. 2006. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 18--18. Google ScholarDigital Library
Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons learned from the analysis of system failures at Petascale: The case of Blue Waters. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE. Google ScholarDigital Library
Gregory Diamos and Sudhakar Yalamanchili. 2010. Speculative execution on multi-GPU systems. In Proceedings of Distributed Processing (IPDPS). IEEE, 1--12.Google ScholarCross Ref
E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A Survey of Rollback-recovery Protocols in Message-passing Systems. Comput. Surveys 34, 3 (Sept. 2002), 375--408. Google ScholarDigital Library
Fernanda Foertter. 2017. Preparing GPU-Accelerated Applications for the Summit Supercomputer. http://on-demand.gputechconf.com/gtc/2017/presentation/s7642-fernanda-foertter-preparing-gpu-accelerated-app.pdf. http://on-demand.gputechconf.com/gtc/2017/presentation/s7642-fernanda-foertter-preparing-gpu-accelerated-app.pdf GPU Technology Conference (GTC).Google Scholar
Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman. 2018. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory. In Proceedings of International Conference on Cluster Computing (CLUSTER).Google ScholarCross Ref
Roberto Gioiosa, Jose Carlos Sancho, Song Jiang, Fabrizio Petrini, and Kei Davis. 2005. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society. Google ScholarDigital Library
Mark Harris. 2016. Unified Memory for CUDA Beginners. https://devblogs.nvidia.com/unified-memory-cuda-beginners/. {Online; accessed 18-Jan-2018}.Google Scholar
Junyoung Heo, Sangho Yi, Yookun Cho, Jiman Hong, and Sung Y Shin. 2005. Space-efficient page-level incremental checkpointing. In Proceedings of the Symposium on Applied Computing (SAC). ACM, 1558--1562. Google ScholarDigital Library
Matthew Hicks. 2017. Clank: Architectural Support for Intermittent Computation. In Proceedings of the International Symposium on Computer Architecture (ISCA). 228--240.Google ScholarDigital Library
Michael R Hines, Umesh Deshpande, and Kartik Gopalan. 2009. Post-copy live migration of virtual machines. ACM SIGOPS operating systems review 43, 3 (2009), 14--26. Google ScholarDigital Library
Joshua Hursey, Chris January, Mark O'Connor, Paul Hargrove, David Lecomber, Jeffrey M Squyres, and Andrew Lumsdaine. 2010. Checkpoint/Restart-Enabled Parallel Debugging. In Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI). 219--228. Google ScholarDigital Library
Dewan Ibtesham, Dorian C Arnold, Kurt B Ferreira, and Patrick G Bridges. 2011. On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.. In Euro-Par Workshops (2). 302--311. Google ScholarDigital Library
Dewan Ibtesham, Kurt B Ferreira, and Dorian Arnold. 2015. A checkpoint compression study for high-performance computing systems. The International Journal of High Performance Computing Applications 29, 4 (2015), 387--402.Google ScholarDigital Library
Intel. 2015. Ushering in a New Era. https://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdf. https://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdfGoogle Scholar
Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. 70--71.Google ScholarDigital Library
Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In Proceedings of Computer-Aided Design (ICCAD). Piscataway, NJ, USA. Google ScholarDigital Library
Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, Adam Crume, and Carlos Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on. IEEE, 1--11.Google ScholarCross Ref
John Mehnert-Spahn, Eugen Feller, and Michael Schoettner. 2009. Incremental checkpointing for grids. In Linux Symposium, Vol. 120.Google Scholar
Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 72--83. Google ScholarDigital Library
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 1--11. Google ScholarDigital Library
NERSC. 2013. Edison Storage and IO. http://www.nersc.gov/users/computational-systems/edison/file-storage-and-i-o/.Google Scholar
NERSC. 2017. Cori Storage and IO. http://www.nersc.gov/users/computational-systems/cori/file-storage-and-i-o/.Google Scholar
Xiang Ni, Esteban Meneses, and Laxmikant V Kalé. 2012. Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In Proceedings of International Conference on Cluster Computing (CLUSTER). IEEE, 364--372. Google ScholarDigital Library
Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS) Workshops. IEEE, 104--113. Google ScholarDigital Library
NVIDIA. 2017. CUDA C Programming Guide, Appendix K: Unified Memory Programming. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf., 267--286 pages. PG-02829-001_v9.1 {Online; accessed 17-Jan-2018}.Google Scholar
NVIDIA. 2018. NVIDIA DGX-2: The world's most powerful AI system for the most complex AI challenges. https://www.nvidia.com/en-us/data-center/dgx-2/.Google Scholar
NVIDIA. 2019. The NVIDIA profiling tool (nvprof). http://docs.nvidia.com/cuda/profiler-users-guide/.Google Scholar
James S Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent checkpointing under unix. Computer Science Department. Google ScholarDigital Library
Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), Vol. 30. IEEE Computer Society, 111--122. Google ScholarDigital Library
Roy Kim. 2016. NVIDIA DGX SATURNV Ranked World's Most Efficient Supercomputer by Wide Margin. https://blogs.nvidia.com/blog/2016/11/14/dgx-saturnv/. https://blogs.nvidia.com/blog/2016/11/14/dgx-saturnv/ NVIDIA Blog.Google Scholar
Samsung. 2018. Samsung PM1725a NVMe SSD. https://www.samsung.com/semiconductor/global.semi.static/Samsung_PM1725a_NVMe_SSD-0.pdf.Google Scholar
Naoto Sasaki, Kento Sato, Toshio Endo, and Satoshi Matsuoka. 2015. Exploration of lossy compression for application-level checkpoint/restart. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 914--922. Google ScholarDigital Library
K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. 2012. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
Akira Nukada Shinichi Miura Akihiro Nomura Hitoshi Sato Hideyuki Jitsumoto Aleksandr Drozd Satoshi Matsuoka, Toshio Endo. 2017. Overview of TSUB-AME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data. https://www.titech.ac.jp/news/pdf/news_17675_2.pdf.Google Scholar
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a file system with GPUs. In ACM SIGPLAN Notices, Vol. 48. ACM, 485--498.Google ScholarDigital Library
Daniel J Sorin, Milo MK Martin, Mark D Hill, and David A Wood. 2002. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA). IEEE, 123--134. Google ScholarDigital Library
Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2009. CheCUDA: A checkpoint/restart tool for CUDA applications. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 408--413. Google ScholarDigital Library
Ivan Tanasic, Isaac Gelado, Marc Jorda, Eduard Ayguade, and Nacho Navarro. 2017. Efficient exception handling support for GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). ACM, 109--122. Google ScholarDigital Library
Devesh Tiwari, Saurabh Gupta, et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331--342.Google ScholarCross Ref
Nitin H Vaidya. 1996. On staggered checkpointing. In Symposium on Proceedings of Parallel and Distributed Processing (SPDP). IEEE, 572--580. Google ScholarDigital Library
Manav Vasavada, Frank Mueller, Paul H Hargrove, and Eric Roman. 2011. Comparing different approaches for incremental checkpointing: The showdown. In Linux Symposium. 69.Google Scholar
Sudharshan S Vazhkudai, Bronis R de Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. 2018. The design, deployment, and evaluation of the CORAL pre-exascale systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 52.Google ScholarDigital Library
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. 2008. Proactive process-level live migration in HPC environments. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, 43. Google ScholarDigital Library
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. 2011. Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS).Google Scholar
Sangho Yi, Junyoung Heo, Yookun Cho, and Jiman Hong. 2006. Adaptive page-level incremental checkpointing based on expected recovery time. In Proceedings of the Symposium on Applied Computing (SAC). ACM, 1472--1476. Google ScholarDigital Library
Chenggang Zhang, Guodong Han, and Cho-Li Wang. 2013. GPU-TLS: An efficient runtime for speculative loop parallelization on gpus. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 120--127.Google Scholar
Gengbin Zheng, Xiang Ni, and Laxmikant V Kalé. 2012. A scalable double in-memory checkpoint and restart scheme towards Exascale. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE, 1--6.Google ScholarCross Ref
T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345--357.Google Scholar
Chris Zimmer. 2018. Summit Burst Buffer. https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro_Summit_Burst-Buffer-Webinar.pdf.Google Scholar

Index Terms

GPU snapshot: checkpoint offloading for GPU-dense systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 2019
533 pages
ISBN:9781450360791
DOI:10.1145/3330345
General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
fault tolerance
resilience
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 491
  Total Downloads
- Downloads (Last 12 months)123
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPU snapshot: checkpoint offloading for GPU-dense systems

ICS '19: Proceedings of the ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GPU snapshot: checkpoint offloading for GPU-dense systems

ICS '19: Proceedings of the ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media