System-Level Transparent Checkpointing for OpenSHMEM

Garg, Rohan; Vienne, Jérôme; Cooperman, Gene

doi:10.1007/978-3-319-50995-2_4

Rohan Garg¹⁷,
Jérôme Vienne¹⁸ &
Gene Cooperman¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10007))

Included in the following conference series:

Workshop on OpenSHMEM and Related Technologies

427 Accesses
2 Citations

Abstract

Fault tolerance is an active area of research for OpenSHMEM programs. In this work, we present the first approach using system-level transparent checkpointing. This complements an existing approach based on application-level checkpointing. Application-level checkpointing has advantages for algorithm-based fault tolerance, while transparent checkpointing can be invoked by the system at an arbitrary time. Unlike the earlier application-level work of Hao et al., this system-level approach creates checkpoint images in stable storage, thus enabling restart at a later time or even process migration. An experimental evaluation is presented using NAS NPB benchmarks for OpenSHMEM. In order to support this work, The design of DMTCP (Distributed MultiThreaded CheckPointing) was extended to support shared memory regions in the absence of virtual memory.

R. Garg and G. Cooperman—This work was partially supported by the National Science Foundation under Grant ACI-1440788.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.J.: A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models. IEEE Computer Society, Los Alamitos (2011)
Book Google Scholar
Ansel, J., Arya, K., Cooperman, G.: DMTCP: transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE Press (2009)
Google Scholar
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, D., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks. Intl. J. Supercomput. Appl. 5(3), 63–73 (1991)
Article Google Scholar
BLCR team: BLCR frequently asked questions (for version 0.8.5). https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#limitations. Accessed June 2016
Bouteiler, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int. J. High Perform. Comput. Appl. 20, 319–333 (2006)
Article Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Rugina, R., McKee, S.A.: Compiler-enhanced incremental checkpointing for OpenMP applications. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009
Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. In: PPoPP 2003: Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, NY, USA, pp. 84–94. ACM, New York (2003)
Google Scholar
Cao, J., Kerr, G., Arya, K., Cooperman, G.: Transparent checkpoint-restart over InfiniBand. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, pp. 13–24. ACM Press (2014)
Google Scholar
Chapman, B., Curtis, T., Pophale, S., Poole, S., Kuehn, J., Koelbel, C., Smith, L.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, pp. 2:1–2:3, PGAS 2010, NY, USA. ACM, New York (2010)
Google Scholar
Duell, J., Hargrove, P., Roman, E.: The design and implementation of Berkeley lab’s Linux checkpoint/restart (BLCR). Technical report LBNL-54941, Lawrence Berkeley National Laboratory (2003)
Google Scholar
Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for MPI programs over InfiniBand. In: ICPP 2006: Proceedings of the 2006 International Conference on Parallel Processing, pp. 471–478. IEEE Computer Society, Washington, DC (2006)
Google Scholar
Graham, R.L., Woodall, T.S., Squyres, J.M.: Open MPI: a flexible high performance MPI. In: Proceedings of the 6th Annual International Conference on Parallel Processing and Applied Mathematics, Poznan, Poland, September 2005
Google Scholar
Hammond, J.: OSHMPI (06 2016). https://github.com/jeffhammond/oshmpi
Hammond, J.R., Ghosh, S., Chapman, B.M.: Implementing OpenSHMEM using MPI-3 one-sided communication. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 44–58. Springer, Heidelberg (2014). doi:10.1007/978-3-319-05215-1_4
Chapter Google Scholar
Hao, P., Pophale, S., Shamis, P., Curtis, T., Chapman, B.: Check-pointing approach for fault tolerance in OpenSHMEM. In: Gorentla Venkata, M., Shamis, P., Imam, N., Lopez, M.G. (eds.) OpenSHMEM 2014. LNCS, vol. 9397, pp. 36–52. Springer, Heidelberg (2015). doi:10.1007/978-3-319-26428-8_3
Chapter Google Scholar
Hao, P., Shamis, P., Venkata, M.G., Pophale, S., Welch, A., Poole, S., Chapman, B.: Fault tolerance for OpenSHMEM. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014, pp. 23:1–23:3 (2014)
Google Scholar
Hargrove, P., Duell, J.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46, 494–499 (2006)
Article Google Scholar
High Performance Computing Tools Group at the University of Houston, Extreme Scale Systems Center, Oak Ridge National Laboratory: OpenSHMEM Application Programming interface (version 1.3). http://openshmem.org/site/sites/default/site_files/OpenSHMEM-1.3.pdf. Accessed June 2016
Huang, W., Santhanaraman, G., Jin, H., Gao, Q., Panda, D.: Design and Implementation of High Performance MVAPICH2: MPI2 Over InfiniBand, May 2007
Google Scholar
Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdain, A.: The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS)/12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems. IEEE Computer Society, March 2007
Google Scholar
Janakiraman, G., Santos, J., Subhraveti, D., Turner, Y.: Cruz: application-transparent distributed checkpoint-restart on standard operating systems. In: Dependable Systems and Networks (DSN 2005), pp. 260–269 (2005)
Google Scholar
Jose, J., Hamidouche, K., Zhang, J., Venkatesh, A., Panda, D.: Optimizing collective communication in UPC, May 2014
Google Scholar
Jose, J., Zhang, J., Venkatesh, A., Potluri, S., Panda, D.K.D.: A comprehensive performance evaluation of OpenSHMEM libraries on InfiniBand clusters. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 14–28. Springer, Heidelberg (2014). doi:10.1007/978-3-319-05215-1_2
Chapter Google Scholar
Laadan, O., Nieh, J.: Transparent checkpoint-restart of multiple processes for commodity clusters. In: 2007 USENIX Annual Technical Conference, pp. 323–336 (2007)
Google Scholar
Laadan, O., Phung, D., Nieh, J.: Transparent networked checkpoint-restart for commodity clusters. In: 2005 IEEE International Conference on Cluster Computing. IEEE Press (2005)
Google Scholar
Laboratory, N.B.C.: MVAPICH2 (06 2016). http://mvapich.cse.ohio-state.edu/
Laboratory, N.B.C.: MVAPICH2-X (06 2016). http://mvapich.cse.ohio-state.edu/
NASA Advanced Supercomputing Division: NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. Accessed Apr 2016
Pophale, S., Nanjegowda, R., Curtis, T., Chapman, B., Jin, H., Poole, S., Kuehn, J.: OpenSHMEM performance and potential: a NPB experimental study. In: The 6th Conference on Partitioned Global Address Space Programming Models (PGAS 2012). Citeseer (2012)
Google Scholar
Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. Int. J. High Perform. Comput. Appl. 19(4), 479–493 (2005)
Article Google Scholar
Sudakov, O.O., Meshcheriakov, I.S., Boyko, Y.V.: CHPOX: transparent checkpointing system for Linux clusters. In: IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pp. 159–164 (2007). software available at http://freshmeat.net/projects/chpox/
TOP500 supercomputer sites (Jun 2016). http://top500.org/list/2016/06/
Vienne, J., Chen, J., Wasi-Ur-Rahman, M., Islam, N.S., Subramoni, H., Panda, D.K.: Performance analysis and evaluation of InfiniBand FDR and 40GigE RoCE on HPC and cloud computing systems. In: Hot Interconnects, pp. 48–55 (2012)
Google Scholar
Wong, F.C., Martin, R.P., Arpaci-Dusseau, R.H., Culler, D.E.: Architectural requirements and scalability of the NAS parallel benchmarks. In: Supercomputing (1999)
Google Scholar

Download references

Acknowledgment

We would like to thank both Kapil Arya and Jiajun Cao for many useful discussions on the internals of DMTCP, and the design of those internal components. We also acknowledge the support of the Texas Advanced Computing Center (TACC) and the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1053575.

Author information

Authors and Affiliations

Northeastern University, Boston, Massachusetts, 02115, USA
Rohan Garg & Gene Cooperman
Texas Advanced Computing Center, The University of Texas at Austin, Austin, Texas, 78758, USA
Jérôme Vienne

Authors

Rohan Garg
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Vienne
View author publications
You can also search for this author in PubMed Google Scholar
Gene Cooperman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rohan Garg .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Manjunath Gorentla Venkata
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Neena Imam
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Swaroop Pophale
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Tiffany M. Mintz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garg, R., Vienne, J., Cooperman, G. (2016). System-Level Transparent Checkpointing for OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T. (eds) OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. OpenSHMEM 2016. Lecture Notes in Computer Science(), vol 10007. Springer, Cham. https://doi.org/10.1007/978-3-319-50995-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-50995-2_4
Published: 15 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50994-5
Online ISBN: 978-3-319-50995-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics