Abstract
Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based double-erasure codes - RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for MPI programs.
This paper is supported partly by the National High Technology Research and Development Program of China (2008AA01Z401), RFDP of China (20070055054), and Science and Technology Development Plan of Tianjin (08JCYBJC13000).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Wu-Chun, F.: The Importance of Being Low Power in High Performance Computing. Cyberinfrastructure Technology Watch Quarterly 1(3), 12–21 (2005)
Message Passing Interface Forum: MPI: A Message Passing Interface Standard. Technical report, University of Tennessee (1994)
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: 10th International Parallel Processing Symposium, Honolulu, USA, pp. 526–531 (1996)
Agbaria, A., Friedman, R.: Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, California, USA, pp. 167–176 (1999)
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, pp. 1–18 (2002)
Fagg, G.E., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)
Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)
Chen, Z., Fagg, G., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault Tolerant High Performance Computing by a Coding Approach. In: 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Chicago, IL, USA, pp. 213–223 (2005)
Liu, X.G., Wang, G., Zhang, Y., Li, A., Xie, F.: The Performance Of Erasure Codes Used In FT-MPI. In: 2nd International Forum on Information Technology and Applications, Chengdu, China (2005)
Plank, J.S.: Erasure Codes for Storage Applications. Tutorial. In: 4th Usenix Conference on File and Storage Technologies, San Francisco, CA, USA (2005)
Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys 26(2), 143–185 (1994)
Plank, J.S.: A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems. Software - Practice & Experience 27(9), 995–1012 (1997)
Corbett, P., English, B., Goel, A., Grcanac, T., Kleiman, S., Leong, J., Sankar, S.: Row-Diagonal Parity for Double Disk Failure Correction. In: 3rd USENIX Conference on File and Storage Technologies, San Francisco, CA, USA, pp. 1–14 (2004)
Blaum, M.: A Family of MDS Array Codes with Minimal Number of Encoding Operations. In: 2006 IEEE International Symposium on Information Theory, Washington, USA, pp. 2784–2788 (2006)
Xu, L., Bohossian, V., Bruck, J., Wagner, D.G.: Low-Density MDS Codes and Factors of Complete Graphs. IEEE Trans. on Information Theory 45(6), 1817–1826 (1999)
Colbourn, C.J., Dinitz, J.H., et al.: Handbook of Combinatorial Designs, 2nd edn. CRC Press, Boca Raton (2007)
Plank, J.S.: The RAID-6 Liberation Codes. In: 6th USENIX Conference on File and Storage Technologies, San Francisco, USA, pp. 97–110 (2008)
Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison Wesley, Edinburgh Gate (2003)
http://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc3/bcsstk23.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, G., Liu, X., Li, A., Zhang, F. (2009). In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes. In: Ropo, M., Westerholm, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03770-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-03770-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03769-6
Online ISBN: 978-3-642-03770-2
eBook Packages: Computer ScienceComputer Science (R0)