skip to main content
10.1145/3330345.3330361acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

GPU snapshot: checkpoint offloading for GPU-dense systems

Published:26 June 2019Publication History

ABSTRACT

Future High-Performance Computing (HPC) systems will likely be composed of accelerator-dense heterogeneous computers because accelerators are able to deliver higher performance at lower costs, socket counts and energy consumption. Such accelerator-dense nodes pose a reliability challenge because preserving a large amount of state within accelerators for checkpointing incurs significant overhead. Checkpointing multiple accelerators at the same time, which is necessary to obtain a consistent coordinated checkpoint, overwhelms the host interconnect, memory and IO band-widths. We propose GPU Snapshot to mitigate this issue by: (1) enabling a fast logical snapshot to be taken, while actual check-pointed state is transferred asynchronously to alleviate bandwidth hot spots; (2) using incremental checkpoints that reduce the volume of data transferred; and (3) checkpoint offloading to limit accelerator complexity and effectively utilize the host. As a concrete example, we describe and evaluate the design tradeoffs of GPU Snapshot in the context of a GPU-dense multi-exascale HPC system. We demonstrate 4--40X checkpoint overhead reductions at the node level, which enables a system with GPU Snapshot to approach the performance of a system with idealized GPU checkpointing.

References

  1. J. Tuck A. Agrawal, G. Loh. 2017. Leveraging Near Data Processing for High-Performance Checkpoint/Restart. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Saurabh Agarwal, Rahul Garg, Meeta S Gupta, and Jose E Moreira. 2004. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the International Supercomputing Conference (ISC). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Abhinav Agrawal, Gabriel H. Loh, and James Tuck. 2017. Leveraging Near Data Processing for High Performance Checkpoint/Restart. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Samer Al-Kiswany, Matei Ripeanu, Sudharshan S Vazhkudai, and Abdullah Gharaibeh. 2008. stdchk: A checkpoint storage system for desktop grid computing. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS). IEEE, 613--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lorenzo Alvisi and Keith Marzullo. 1998. Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Transactions on Software Engineering 24, 2 (1998), 149--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Maciej Besta and Torsten Hoefler. 2014. Fault tolerance for remote memory access programming models. In Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM, 37--48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, et al. 2016. Accelerating science with the NERSC burst buffer early user program. CUG2016 Proceedings (2016).Google ScholarGoogle Scholar
  8. Yong Chen. 2011. Towards Scalable I/O Architecture for Exascale Systems. In Proceedings of the Workshop on Many Task Computing on Grids and Supercomputers (MTAGS). 43--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live migration of virtual machines. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, 273--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguezb, and F. Cappello. 2006. Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 18--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons learned from the analysis of system failures at Petascale: The case of Blue Waters. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gregory Diamos and Sudhakar Yalamanchili. 2010. Speculative execution on multi-GPU systems. In Proceedings of Distributed Processing (IPDPS). IEEE, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  13. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A Survey of Rollback-recovery Protocols in Message-passing Systems. Comput. Surveys 34, 3 (Sept. 2002), 375--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Fernanda Foertter. 2017. Preparing GPU-Accelerated Applications for the Summit Supercomputer. http://on-demand.gputechconf.com/gtc/2017/presentation/s7642-fernanda-foertter-preparing-gpu-accelerated-app.pdf. http://on-demand.gputechconf.com/gtc/2017/presentation/s7642-fernanda-foertter-preparing-gpu-accelerated-app.pdf GPU Technology Conference (GTC).Google ScholarGoogle Scholar
  15. Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman. 2018. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory. In Proceedings of International Conference on Cluster Computing (CLUSTER).Google ScholarGoogle ScholarCross RefCross Ref
  16. Roberto Gioiosa, Jose Carlos Sancho, Song Jiang, Fabrizio Petrini, and Kei Davis. 2005. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mark Harris. 2016. Unified Memory for CUDA Beginners. https://devblogs.nvidia.com/unified-memory-cuda-beginners/. {Online; accessed 18-Jan-2018}.Google ScholarGoogle Scholar
  18. Junyoung Heo, Sangho Yi, Yookun Cho, Jiman Hong, and Sung Y Shin. 2005. Space-efficient page-level incremental checkpointing. In Proceedings of the Symposium on Applied Computing (SAC). ACM, 1558--1562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Matthew Hicks. 2017. Clank: Architectural Support for Intermittent Computation. In Proceedings of the International Symposium on Computer Architecture (ISCA). 228--240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michael R Hines, Umesh Deshpande, and Kartik Gopalan. 2009. Post-copy live migration of virtual machines. ACM SIGOPS operating systems review 43, 3 (2009), 14--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Joshua Hursey, Chris January, Mark O'Connor, Paul Hargrove, David Lecomber, Jeffrey M Squyres, and Andrew Lumsdaine. 2010. Checkpoint/Restart-Enabled Parallel Debugging. In Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI). 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dewan Ibtesham, Dorian C Arnold, Kurt B Ferreira, and Patrick G Bridges. 2011. On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance.. In Euro-Par Workshops (2). 302--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dewan Ibtesham, Kurt B Ferreira, and Dorian Arnold. 2015. A checkpoint compression study for high-performance computing systems. The International Journal of High Performance Computing Applications 29, 4 (2015), 387--402.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Intel. 2015. Ushering in a New Era. https://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdf. https://www.intel.com/content/dam/www/public/us/en/documents/presentation/intel-argonne-aurora-announcement-presentation.pdfGoogle ScholarGoogle Scholar
  25. Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. 70--71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In Proceedings of Computer-Aided Design (ICCAD). Piscataway, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, Adam Crume, and Carlos Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on. IEEE, 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  29. John Mehnert-Spahn, Eugen Feller, and Michael Schoettner. 2009. Incremental checkpointing for grids. In Linux Symposium, Vol. 120.Google ScholarGoogle Scholar
  30. Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R de Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. NERSC. 2013. Edison Storage and IO. http://www.nersc.gov/users/computational-systems/edison/file-storage-and-i-o/.Google ScholarGoogle Scholar
  33. NERSC. 2017. Cori Storage and IO. http://www.nersc.gov/users/computational-systems/cori/file-storage-and-i-o/.Google ScholarGoogle Scholar
  34. Xiang Ni, Esteban Meneses, and Laxmikant V Kalé. 2012. Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In Proceedings of International Conference on Cluster Computing (CLUSTER). IEEE, 364--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS) Workshops. IEEE, 104--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NVIDIA. 2017. CUDA C Programming Guide, Appendix K: Unified Memory Programming. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf., 267--286 pages. PG-02829-001_v9.1 {Online; accessed 17-Jan-2018}.Google ScholarGoogle Scholar
  37. NVIDIA. 2018. NVIDIA DGX-2: The world's most powerful AI system for the most complex AI challenges. https://www.nvidia.com/en-us/data-center/dgx-2/.Google ScholarGoogle Scholar
  38. NVIDIA. 2019. The NVIDIA profiling tool (nvprof). http://docs.nvidia.com/cuda/profiler-users-guide/.Google ScholarGoogle Scholar
  39. James S Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent checkpointing under unix. Computer Science Department. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Milos Prvulovic, Zheng Zhang, and Josep Torrellas. 2002. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), Vol. 30. IEEE Computer Society, 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Roy Kim. 2016. NVIDIA DGX SATURNV Ranked World's Most Efficient Supercomputer by Wide Margin. https://blogs.nvidia.com/blog/2016/11/14/dgx-saturnv/. https://blogs.nvidia.com/blog/2016/11/14/dgx-saturnv/ NVIDIA Blog.Google ScholarGoogle Scholar
  42. Samsung. 2018. Samsung PM1725a NVMe SSD. https://www.samsung.com/semiconductor/global.semi.static/Samsung_PM1725a_NVMe_SSD-0.pdf.Google ScholarGoogle Scholar
  43. Naoto Sasaki, Kento Sato, Toshio Endo, and Satoshi Matsuoka. 2015. Exploration of lossy compression for application-level checkpoint/restart. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 914--922. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. 2012. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Akira Nukada Shinichi Miura Akihiro Nomura Hitoshi Sato Hideyuki Jitsumoto Aleksandr Drozd Satoshi Matsuoka, Toshio Endo. 2017. Overview of TSUB-AME3.0, Green Cloud Supercomputer for Convergence of HPC, AI and Big-Data. https://www.titech.ac.jp/news/pdf/news_17675_2.pdf.Google ScholarGoogle Scholar
  46. Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a file system with GPUs. In ACM SIGPLAN Notices, Vol. 48. ACM, 485--498.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Daniel J Sorin, Milo MK Martin, Mark D Hill, and David A Wood. 2002. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA). IEEE, 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2009. CheCUDA: A checkpoint/restart tool for CUDA applications. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 408--413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ivan Tanasic, Isaac Gelado, Marc Jorda, Eduard Ayguade, and Nacho Navarro. 2017. Efficient exception handling support for GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). ACM, 109--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Devesh Tiwari, Saurabh Gupta, et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). IEEE, 331--342.Google ScholarGoogle ScholarCross RefCross Ref
  51. Nitin H Vaidya. 1996. On staggered checkpointing. In Symposium on Proceedings of Parallel and Distributed Processing (SPDP). IEEE, 572--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Manav Vasavada, Frank Mueller, Paul H Hargrove, and Eric Roman. 2011. Comparing different approaches for incremental checkpointing: The showdown. In Linux Symposium. 69.Google ScholarGoogle Scholar
  53. Sudharshan S Vazhkudai, Bronis R de Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. 2018. The design, deployment, and evaluation of the CORAL pre-exascale systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). 52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. 2008. Proactive process-level live migration in HPC environments. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, 43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. 2011. Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS).Google ScholarGoogle Scholar
  56. Sangho Yi, Junyoung Heo, Yookun Cho, and Jiman Hong. 2006. Adaptive page-level incremental checkpointing based on expected recovery time. In Proceedings of the Symposium on Applied Computing (SAC). ACM, 1472--1476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Chenggang Zhang, Guodong Han, and Cho-Li Wang. 2013. GPU-TLS: An efficient runtime for speculative loop parallelization on gpus. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 120--127.Google ScholarGoogle Scholar
  58. Gengbin Zheng, Xiang Ni, and Laxmikant V Kalé. 2012. A scalable double in-memory checkpoint and restart scheme towards Exascale. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  59. T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345--357.Google ScholarGoogle Scholar
  60. Chris Zimmer. 2018. Summit Burst Buffer. https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro_Summit_Burst-Buffer-Webinar.pdf.Google ScholarGoogle Scholar

Index Terms

  1. GPU snapshot: checkpoint offloading for GPU-dense systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '19: Proceedings of the ACM International Conference on Supercomputing
      June 2019
      533 pages
      ISBN:9781450360791
      DOI:10.1145/3330345

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 June 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader