Fast In-Memory Checkpointing with POSIX API for Legacy Exascale-Applications

Fajerski, Jan; Noack, Matthias; Reinefeld, Alexander; Schintke, Florian; Schütt, Torsten; Steinke, Thomas

doi:10.1007/978-3-319-40528-5_19

Jan Fajerski¹⁰,
Matthias Noack¹⁰,
Alexander Reinefeld¹⁰,
Florian Schintke¹⁰,
Torsten Schütt¹⁰ &
…
Thomas Steinke¹⁰

Part of the book series: Lecture Notes in Computational Science and Engineering ((LNCSE,volume 113))

924 Accesses

Abstract

Exascale systems will be much more vulnerable to failures than today’s high-performance computers. We present a scheme that writes erasure-encoded checkpoints to other nodes’ memory. The rationale is twofold: first, writing to memory over the interconnect is several orders of magnitude faster than traditional disk-based checkpointing and second, erasure encoded data is able to survive component failures. We use a distributed file system with a tmpfs back end and intercept file accesses with LD_PRELOAD. Using a POSIX file system API, legacy applications which are prepared for application-level checkpoint/restart, can quickly materialize their checkpoints via the supercomputer’s interconnect without the need to change the source code. Experimental results show that the LD_PRELOAD client yields 69 % better sequential bandwidth (with striping) than FUSE while still being transparent to the application. With erasure encoding the performance is 17 % to 49 % worse than striping because of the additional data handling and encoding effort. Even so, our results indicate that erasure-encoded memory checkpoint/restart is an effective means to improve resilience for exascale computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

A Fail-Safe NVRAM Based Mechanism for Efficient Creation and Recovery of Data Copies in Parallel MPI Applications

Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory

Notes

1.
http://wiki.lustre.org/
2.
The Cray XC40 ‘Konrad’ is operated at ZIB as part of the North German Supercomputer Alliance. It comprises 1872 nodes (44.928 cores), Cray Aries network, 120 TB main memory, and a parallel Lustre file system of 4.5 PB capacity and 52 GB/s bandwidth.
3.
https://computation.llnl.gov/project/scr/
4.
FUSE—Filesystem in Userspace allows the creation of a file system without changing Linux kernel code.
5.
http://wiki.lustre.org/index.php/LibLustre_How-To_Guide
6.
https://www.rrz.uni-hamburg.de/services/hpc/bqcd.html
7.
IOR is a I/O micro benchmark software by NERSC. https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/ior/

References

Asteris, M., Dimakis, A.G.: Repairable fountain codes. In: 2012 IEEE International Symposium on Information Theory Proceedings (ISIT), pp. 1752–1756. IEEE (2012)
Google Scholar
Baumann, W., Laubender, G., Läuter, M., Reinefeld, A., Schimmel, C., Steinke, T., Tuma, C., Wollny S.: HLRN-III at Zuse Institute Berlin. In: Vetter, J. (ed.) Contemporary High Performance Computing: From Petascale Toward Exascale, vol. 2, pp. 85–118. Chapman & Hall/CRC Press (2014)
Google Scholar
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11), New York, pp. 32:1–32:32. ACM (2011)
Google Scholar
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1 (1), 1–28 (2014)
Google Scholar
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. In: Proceedings of SciDAC 2006, Denver (2006)
Google Scholar
Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., Yekhanin, S.: Erasure coding in Windows Azure storage. In: Presented as Part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), Boston, pp. 15–26. ACM (2012)
Google Scholar
Lucas, R., et al.: Top ten exascale research challenges. Department of Energy ASCAC subcommittee report (2014)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K.K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10), New York. ACM (2010)
Google Scholar
Mu, S., Chen, K., Wu, Y., Zheng, W.: When Paxos meets erasure code: reduce network and storage cost in state machine replication. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC’14), New York, pp. 61–72. ACM (2014)
Google Scholar
Nagle, D., Serenyi, D., Matthews, A.: The Panasas activescale storage cluster: delivering scalable high bandwidth storage. In: Proceedings of the SC’04, Pittsburgh, p. 53. ACM (2004). http://dl.acm.org/citation.cfm?id=1049998
Peter, K., Reinefeld, A.: Consistency and fault tolerance for erasure-coded distributed storage systems. In: Proceedings of the Fifth International Workshop on Data-Intensive Distributed Computing Date (DIDC’12), New York, pp. 23–32. ACM (2012)
Google Scholar
Plank, J., Li, K.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9 (10), 972–986 (1998)
Article Google Scholar
Plank, J.S., Simmerman, S., Schuman, C.D: Jerasure: a library in C facilitating erasure coding for storage applications. Technical report CS-07-603, University of Tennessee Department of Electrical Engineering and Computer Science (2007)
Google Scholar
Rashmi, K.V., Shah, N.B., Gu, D., Kuang, H., Borthakur, D., Ramchandran, K.: A “Hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. SIGCOMM Comput. Commun. Rev. 44 (4), 331–342 (2014)
Article Google Scholar
Rashmi, K.V., Nakkiran, P., Wang, J., Shah, N.B., Ramchandran, K.: Having your cake and eating it too: jointly optimal erasure codes for I/O, storage, and network-bandwidth. In: 13th USENIX Conference on File and Storage Technologies (FAST 15), Santa Clara, pp. 81–94. USENIX Association (2015)
Google Scholar
Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: XORing elephants: novel erasure codes for big data. Proc. VLDB Endow. 6 (5), 325–336 (2013)
Article Google Scholar
Schmuck, F., Haskin, R.: GPFS: a shared-disk file system for large computing clusters. In: Proceedings of the USENIX FAST’02, Monterey. USENIX Association (2002)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST’10), Washington, DC, pp. 1–10. IEEE Computer Society (2010)
Google Scholar
Stender, J., Berlin, M., Reinefeld, A.: XtreemFS – a file system for the cloud. In: Kyriazis, D., Voulodimos, A., Gogouvitis, S., Varvarigou, T. (eds.) Data Intensive Storage Services for Cloud Environments. IGI Global (2013)
Google Scholar
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI’06), Seattle, pp. 307–320. ACM (2006)
Google Scholar
Weinhold, C., Lackorzynski, A., Bierbaum, J., Küttler, M., Planeta, M., Härtig, H., Shiloh, A., Levy, E., Ben-Nun, T., Barak, A., Steinke, T., Schütt, T., Fajerski, J., Reinefeld, A., Lieber, M., Nagel, W.E.: FFMK: a fast and fault-tolerant microkernel-based system for exascale computing. In: Proceedings of SPPEXA Symposium, Garching. Springer (2016)
Google Scholar
Wende, F., Steinke, T., Reinefeld, A.: The impact of process placement and oversubscription on application performance: a case study for exascale computing. In: Exascale Applications and Software Conference (ESAX-2015), Edinburgh (2015)
Google Scholar
Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE International Conference on Cluster Computing, San Diego, pp. 93–103. IEEE (2004)
Google Scholar
Zheng, G., Ni, X., Kalé, L.V.: A scalable double in-memory checkpoint and restart scheme towards exascale. In: Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, pp. 1–6. IEEE (2012)
Google Scholar

Download references

Acknowledgements

We thank Johannes Dillmann who performed some of the experiments. This work was supported by the DFG SPPEXA project ‘A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing’ (FFMK) and the North German Supercomputer Alliance HLRN.

Author information

Authors and Affiliations

Zuse Institute Berlin (ZIB), Berlin, Germany
Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Torsten Schütt & Thomas Steinke

Authors

Jan Fajerski
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Noack
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Reinefeld
View author publications
You can also search for this author in PubMed Google Scholar
Florian Schintke
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Schütt
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Steinke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Fajerski .

Editor information

Editors and Affiliations

Technische Universität München Institut für Informatik, Garching, Bayern, Germany
Hans-Joachim Bungartz
Technische Universität München Institut für Informatik, Garching, Germany
Philipp Neumann
Technische Universität Dresden, Dresden, Germany
Wolfgang E. Nagel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fajerski, J., Noack, M., Reinefeld, A., Schintke, F., Schütt, T., Steinke, T. (2016). Fast In-Memory Checkpointing with POSIX API for Legacy Exascale-Applications. In: Bungartz, HJ., Neumann, P., Nagel, W. (eds) Software for Exascale Computing - SPPEXA 2013-2015. Lecture Notes in Computational Science and Engineering, vol 113. Springer, Cham. https://doi.org/10.1007/978-3-319-40528-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-40528-5_19
Published: 15 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40526-1
Online ISBN: 978-3-319-40528-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics