ABSTRACT
Future High-Performance Computing (HPC) systems will likely be composed of accelerator-dense heterogeneous computers because accelerators are able to deliver higher performance at lower costs, socket counts and energy consumption. Such accelerator-dense nodes pose a reliability challenge because preserving a large amount of state within accelerators for checkpointing incurs significant overhead. Checkpointing multiple accelerators at the same time, which is necessary to obtain a consistent coordinated checkpoint, overwhelms the host interconnect, memory and IO band-widths. We propose GPU Snapshot to mitigate this issue by: (1) enabling a fast logical snapshot to be taken, while actual check-pointed state is transferred asynchronously to alleviate bandwidth hot spots; (2) using incremental checkpoints that reduce the volume of data transferred; and (3) checkpoint offloading to limit accelerator complexity and effectively utilize the host. As a concrete example, we describe and evaluate the design tradeoffs of GPU Snapshot in the context of a GPU-dense multi-exascale HPC system. We demonstrate 4--40X checkpoint overhead reductions at the node level, which enables a system with GPU Snapshot to approach the performance of a system with idealized GPU checkpointing.
- J. Tuck A. Agrawal, G. Loh. 2017. Leveraging Near Data Processing for High-Performance Checkpoint/Restart. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
- Saurabh Agarwal, Rahul Garg, Meeta S Gupta, and Jose E Moreira. 2004. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the International Supercomputing Conference (ISC). ACM. Google ScholarDigital Library
- Abhinav Agrawal, Gabriel H. Loh, and James Tuck. 2017. Leveraging Near Data Processing for High Performance Checkpoint/Restart. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE. Google ScholarDigital Library
- Samer Al-Kiswany, Matei Ripeanu, Sudharshan S Vazhkudai, and Abdullah Gharaibeh. 2008. stdchk: A checkpoint storage system for desktop grid computing. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS). IEEE, 613--624. Google ScholarDigital Library
- Lorenzo Alvisi and Keith Marzullo. 1998. Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Transactions on Software Engineering 24, 2 (1998), 149--159. Google ScholarDigital Library
- Maciej Besta and Torsten Hoefler. 2014. Fault tolerance for remote memory access programming models. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM, 37--48.Google ScholarDigital Library
- Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, et al. 2016. Accelerating science with the NERSC burst buffer early user program. CUG2016 Proceedings (2016).Google Scholar
- Yong Chen. 2011. Towards Scalable I/O Architecture for Exascale Systems. In Proceedings of the Workshop on Many Task Computing on Grids and Supercomputers (MTAGS). 43--48. Google ScholarDigital Library
- Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live migration of virtual machines. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, 273--286. Google ScholarDigital Library
- C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguezb, and F. Cappello. 2006. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 18--18. Google ScholarDigital Library
- Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons learned from the analysis of system failures at Petascale: The case of Blue Waters. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE. Google ScholarDigital Library
- Gregory Diamos and Sudhakar Yalamanchili. 2010. Speculative execution on multi-GPU systems. In Proceedings of Distributed Processing (IPDPS). IEEE, 1--12.Google ScholarCross Ref
- E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A Survey of Rollback-recovery Protocols in Message-passing Systems. Comput. Surveys 34, 3 (Sept. 2002), 375--408. Google ScholarDigital Library
- Fernanda Foertter. 2017. Preparing GPU-Accelerated Applications for the Summit Supercomputer. http://on-demand.gputechconf.com/gtc/2017/presentation/s7642-fernanda-foertter-preparing-gpu-accelerated-app.pdf. http://on-demand.gputechconf.com/gtc/2017/presentation/s7642-fernanda-foertter-preparing-gpu-accelerated-app.pdf GPU Technology Conference (GTC).Google Scholar
- Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman. 2018. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory. In Proceedings of International Conference on Cluster Computing (CLUSTER).Google ScholarCross Ref
- Roberto Gioiosa, Jose Carlos Sancho, Song Jiang, Fabrizio Petrini, and Kei Davis. 2005. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society. Google ScholarDigital Library
- Mark Harris. 2016. Unified Memory for CUDA Beginners. https://devblogs.nvidia.com/unified-memory-cuda-beginners/. {Online; accessed 18-Jan-2018}.Google Scholar
- Junyoung Heo, Sangho Yi, Yookun Cho, Jiman Hong, and Sung Y Shin. 2005. Space-efficient page-level incremental checkpointing. In Proceedings of the Symposium on Applied Computing (SAC). ACM, 1558--1562. Google ScholarDigital Library
- Matthew Hicks. 2017. Clank: Architectural Support for Intermittent Computation. In Proceedings of the International Symposium on Computer Architecture (ISCA). 228--240.Google ScholarDigital Library
- Michael R Hines, Umesh Deshpande, and Kartik Gopalan. 2009. Post-copy live migration of virtual machines. ACM SIGOPS operating systems review 43, 3 (2009), 14--26. Google ScholarDigital Library
- Joshua Hursey, Chris January, Mark O'Connor, Paul Hargrove, David Lecomber, Jeffrey M Squyres, and Andrew Lumsdaine. 2010. Checkpoint/Restart-Enabled Parallel Debugging. In Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI). 219--228. Google ScholarDigital Library
- Dewan Ibtesham, Dorian C Arnold, Kurt B Ferreira, and Patrick G Bridges. 2011. On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.. In Euro-Par Workshops (2). 302--311. Google ScholarDigital Library
- Dewan Ibtesham, Kurt B Ferreira, and Dorian Arnold. 2015. A checkpoint compression study for high-performance computing systems. The International Journal of High Performance Computing Applications 29, 4 (2015), 387--402.Google ScholarDigital Library
- Intel. 2015. Ushering in a New Era. https://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdf. https://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdfGoogle Scholar
- Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. 70--71.Google ScholarDigital Library
- Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In Proceedings of Computer-Aided Design (ICCAD). Piscataway, NJ, USA. Google ScholarDigital Library
- Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, Adam Crume, and Carlos Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on. IEEE, 1--11.Google ScholarCross Ref
- John Mehnert-Spahn, Eugen Feller, and Michael Schoettner. 2009. Incremental checkpointing for grids. In Linux Symposium, Vol. 120.Google Scholar
- Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 72--83. Google ScholarDigital Library
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 1--11. Google ScholarDigital Library
- NERSC. 2013. Edison Storage and IO. http://www.nersc.gov/users/computational-systems/edison/file-storage-and-i-o/.Google Scholar
- NERSC. 2017. Cori Storage and IO. http://www.nersc.gov/users/computational-systems/cori/file-storage-and-i-o/.Google Scholar
- Xiang Ni, Esteban Meneses, and Laxmikant V Kalé. 2012. Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In Proceedings of International Conference on Cluster Computing (CLUSTER). IEEE, 364--372. Google ScholarDigital Library
- Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS) Workshops. IEEE, 104--113. Google ScholarDigital Library
- NVIDIA. 2017. CUDA C Programming Guide, Appendix K: Unified Memory Programming. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf., 267--286 pages. PG-02829-001_v9.1 {Online; accessed 17-Jan-2018}.Google Scholar
- NVIDIA. 2018. NVIDIA DGX-2: The world's most powerful AI system for the most complex AI challenges. https://www.nvidia.com/en-us/data-center/dgx-2/.Google Scholar
- NVIDIA. 2019. The NVIDIA profiling tool (nvprof). http://docs.nvidia.com/cuda/profiler-users-guide/.Google Scholar
- James S Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent checkpointing under unix. Computer Science Department. Google ScholarDigital Library
- Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), Vol. 30. IEEE Computer Society, 111--122. Google ScholarDigital Library
- Roy Kim. 2016. NVIDIA DGX SATURNV Ranked World's Most Efficient Supercomputer by Wide Margin. https://blogs.nvidia.com/blog/2016/11/14/dgx-saturnv/. https://blogs.nvidia.com/blog/2016/11/14/dgx-saturnv/ NVIDIA Blog.Google Scholar
- Samsung. 2018. Samsung PM1725a NVMe SSD. https://www.samsung.com/semiconductor/global.semi.static/Samsung_PM1725a_NVMe_SSD-0.pdf.Google Scholar
- Naoto Sasaki, Kento Sato, Toshio Endo, and Satoshi Matsuoka. 2015. Exploration of lossy compression for application-level checkpoint/restart. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 914--922. Google ScholarDigital Library
- K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. 2012. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
- Akira Nukada Shinichi Miura Akihiro Nomura Hitoshi Sato Hideyuki Jitsumoto Aleksandr Drozd Satoshi Matsuoka, Toshio Endo. 2017. Overview of TSUB-AME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data. https://www.titech.ac.jp/news/pdf/news_17675_2.pdf.Google Scholar
- Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a file system with GPUs. In ACM SIGPLAN Notices, Vol. 48. ACM, 485--498.Google ScholarDigital Library
- Daniel J Sorin, Milo MK Martin, Mark D Hill, and David A Wood. 2002. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA). IEEE, 123--134. Google ScholarDigital Library
- Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2009. CheCUDA: A checkpoint/restart tool for CUDA applications. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 408--413. Google ScholarDigital Library
- Ivan Tanasic, Isaac Gelado, Marc Jorda, Eduard Ayguade, and Nacho Navarro. 2017. Efficient exception handling support for GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). ACM, 109--122. Google ScholarDigital Library
- Devesh Tiwari, Saurabh Gupta, et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331--342.Google ScholarCross Ref
- Nitin H Vaidya. 1996. On staggered checkpointing. In Symposium on Proceedings of Parallel and Distributed Processing (SPDP). IEEE, 572--580. Google ScholarDigital Library
- Manav Vasavada, Frank Mueller, Paul H Hargrove, and Eric Roman. 2011. Comparing different approaches for incremental checkpointing: The showdown. In Linux Symposium. 69.Google Scholar
- Sudharshan S Vazhkudai, Bronis R de Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. 2018. The design, deployment, and evaluation of the CORAL pre-exascale systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 52.Google ScholarDigital Library
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. 2008. Proactive process-level live migration in HPC environments. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, 43. Google ScholarDigital Library
- Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. 2011. Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS).Google Scholar
- Sangho Yi, Junyoung Heo, Yookun Cho, and Jiman Hong. 2006. Adaptive page-level incremental checkpointing based on expected recovery time. In Proceedings of the Symposium on Applied Computing (SAC). ACM, 1472--1476. Google ScholarDigital Library
- Chenggang Zhang, Guodong Han, and Cho-Li Wang. 2013. GPU-TLS: An efficient runtime for speculative loop parallelization on gpus. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 120--127.Google Scholar
- Gengbin Zheng, Xiang Ni, and Laxmikant V Kalé. 2012. A scalable double in-memory checkpoint and restart scheme towards Exascale. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE, 1--6.Google ScholarCross Ref
- T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345--357.Google Scholar
- Chris Zimmer. 2018. Summit Burst Buffer. https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro_Summit_Burst-Buffer-Webinar.pdf.Google Scholar
Index Terms
- GPU snapshot: checkpoint offloading for GPU-dense systems
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster ComputingIn this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Comments