Abstract
This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An Analysis of Communication Induced Checkpointing. In: FTCS-29, The 29th International Symposium on Fault-Tolerant Computing, pp. 242–249.
Alvisi, L., Marzullo, K.: Trade-Offs in Implementing Causal Message Logging Protocols. In: Proceedings of the 15th ACM Annual Symposium on the Principles of Distributed Computing, pp. 58–67 (May 1996)
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Proc. IEEE Symp. on High Performance Distributed Computing, pp. 167–176 (August 1999)
Bouteiller, A., Lemarinier, P., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault-tolerant MPI. In: Proceedings of Cluster 2003, pp. 242–250 (December 2003)
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. on Computing Systems 3(1), 63–75 (1985)
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing 22(6), 789–828 (1996)
Karnois, N.T., Toonen, B., Foster, I.: MPICH-G2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)
NASA Ames Research Center: Nas parallel benchmarks. Technical report (1997), http://science.nas.nasa.gov/Software/NPB/
Neves, N., Fuchs, W.K.: RENEW: A tool for fast and efficient implementation of checkpoint protocols. In: Symp. on Fault-Tolerant Computing, pp. 58–67 (1998)
Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia, USA (2000)
Nguyen, G.T., Tran, V.D., Kotocová, M.: Application recovery in parallel programming environment. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 234–242. Springer, Heidelberg (2002)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: USENIX Winter 1995 Technical Conference (January 1995)
Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. on Parallel and Distributed Systems 9(10), 972–986 (1998)
Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. IEEE Transaction on Knowledge and Data Engineering 12(2), 160–173 (2000)
Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Symp. on Fault-Tolerant Computing, pp. 48–55 (1999)
Russ, S.H., Robinson, J., Flachs, B.K., Heckel, B.: The Hector distributed run-time environment. IEEE Trans. on Parallel and Distributed Systems 9(11), 1102–1114 (1998)
Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: Proc. the Int’l Parallel Processing Symp., pp. 526–531 (April 1996)
Zandy, V.: ckpt library, http://www.cs.wisc.edu/zandy/ckpt/
Zwaenepoel, W., Elnozahy, E.N.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers C-41(5), 526–531 (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Woo, N., Jung, H., Shin, D., Han, H., Yeom, H.Y., Park, T. (2005). Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_12
Download citation
DOI: https://doi.org/10.1007/11408901_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25723-3
Online ISBN: 978-3-540-32019-7
eBook Packages: Computer ScienceComputer Science (R0)