Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF

Woo, Namyoon; Jung, Hyungsoo; Shin, Dongin; Han, Hyuck; Yeom, Heon Y.; Park, Taesoon

doi:10.1007/11408901_12

Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF

Namyoon Woo¹⁹,
Hyungsoo Jung¹⁹,
Dongin Shin¹⁹,
Hyuck Han¹⁹,
Heon Y. Yeom¹⁹ &
…
Taesoon Park²⁰

Conference paper

739 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3463))

Abstract

This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An Analysis of Communication Induced Checkpointing. In: FTCS-29, The 29th International Symposium on Fault-Tolerant Computing, pp. 242–249.
Google Scholar
Alvisi, L., Marzullo, K.: Trade-Offs in Implementing Causal Message Logging Protocols. In: Proceedings of the 15th ACM Annual Symposium on the Principles of Distributed Computing, pp. 58–67 (May 1996)
Google Scholar
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Proc. IEEE Symp. on High Performance Distributed Computing, pp. 167–176 (August 1999)
Google Scholar
Bouteiller, A., Lemarinier, P., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault-tolerant MPI. In: Proceedings of Cluster 2003, pp. 242–250 (December 2003)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. on Computing Systems 3(1), 63–75 (1985)
Article Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing 22(6), 789–828 (1996)
Article MATH Google Scholar
Karnois, N.T., Toonen, B., Foster, I.: MPICH-G2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)
Article Google Scholar
NASA Ames Research Center: Nas parallel benchmarks. Technical report (1997), http://science.nas.nasa.gov/Software/NPB/
Neves, N., Fuchs, W.K.: RENEW: A tool for fast and efficient implementation of checkpoint protocols. In: Symp. on Fault-Tolerant Computing, pp. 58–67 (1998)
Google Scholar
Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia, USA (2000)
Google Scholar
Nguyen, G.T., Tran, V.D., Kotocová, M.: Application recovery in parallel programming environment. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 234–242. Springer, Heidelberg (2002)
Chapter Google Scholar
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: USENIX Winter 1995 Technical Conference (January 1995)
Google Scholar
Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. on Parallel and Distributed Systems 9(10), 972–986 (1998)
Article Google Scholar
Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. IEEE Transaction on Knowledge and Data Engineering 12(2), 160–173 (2000)
Article Google Scholar
Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Symp. on Fault-Tolerant Computing, pp. 48–55 (1999)
Google Scholar
Russ, S.H., Robinson, J., Flachs, B.K., Heckel, B.: The Hector distributed run-time environment. IEEE Trans. on Parallel and Distributed Systems 9(11), 1102–1114 (1998)
Article Google Scholar
Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: Proc. the Int’l Parallel Processing Symp., pp. 526–531 (April 1996)
Google Scholar
Zandy, V.: ckpt library, http://www.cs.wisc.edu/zandy/ckpt/
Zwaenepoel, W., Elnozahy, E.N.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers C-41(5), 526–531 (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea
Namyoon Woo, Hyungsoo Jung, Dongin Shin, Hyuck Han & Heon Y. Yeom
Department of Computer Engineering, Sejong University, Seoul, 143-747, Korea
Taesoon Park

Authors

Namyoon Woo
View author publications
You can also search for this author in PubMed Google Scholar
Hyungsoo Jung
View author publications
You can also search for this author in PubMed Google Scholar
Dongin Shin
View author publications
You can also search for this author in PubMed Google Scholar
Hyuck Han
View author publications
You can also search for this author in PubMed Google Scholar
Heon Y. Yeom
View author publications
You can also search for this author in PubMed Google Scholar
Taesoon Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Computer Sciences III, University of Erlangen-Nürnberg, Martensstr. 3, 91058, Erlangen, Germany
Mario Dal Cin
UPS, INSA, INP, ISAE; LAAS-CNRS, Université de Toulouse, Toulouse, France
Mohamed Kaâniche
Department of Measurement and Information Systems, Budapest University of Technology and Economics,
András Pataricza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Woo, N., Jung, H., Shin, D., Han, H., Yeom, H.Y., Park, T. (2005). Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_12

Download citation

DOI: https://doi.org/10.1007/11408901_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25723-3
Online ISBN: 978-3-540-32019-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics