Skip to main content

Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3463))

Abstract

This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An Analysis of Communication Induced Checkpointing. In: FTCS-29, The 29th International Symposium on Fault-Tolerant Computing, pp. 242–249.

    Google Scholar 

  2. Alvisi, L., Marzullo, K.: Trade-Offs in Implementing Causal Message Logging Protocols. In: Proceedings of the 15th ACM Annual Symposium on the Principles of Distributed Computing, pp. 58–67 (May 1996)

    Google Scholar 

  3. Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Proc. IEEE Symp. on High Performance Distributed Computing, pp. 167–176 (August 1999)

    Google Scholar 

  4. Bouteiller, A., Lemarinier, P., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault-tolerant MPI. In: Proceedings of Cluster 2003, pp. 242–250 (December 2003)

    Google Scholar 

  5. Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. on Computing Systems 3(1), 63–75 (1985)

    Article  Google Scholar 

  6. Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  7. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing 22(6), 789–828 (1996)

    Article  MATH  Google Scholar 

  8. Karnois, N.T., Toonen, B., Foster, I.: MPICH-G2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)

    Article  Google Scholar 

  9. NASA Ames Research Center: Nas parallel benchmarks. Technical report (1997), http://science.nas.nasa.gov/Software/NPB/

  10. Neves, N., Fuchs, W.K.: RENEW: A tool for fast and efficient implementation of checkpoint protocols. In: Symp. on Fault-Tolerant Computing, pp. 58–67 (1998)

    Google Scholar 

  11. Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia, USA (2000)

    Google Scholar 

  12. Nguyen, G.T., Tran, V.D., Kotocová, M.: Application recovery in parallel programming environment. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 234–242. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: USENIX Winter 1995 Technical Conference (January 1995)

    Google Scholar 

  14. Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. on Parallel and Distributed Systems 9(10), 972–986 (1998)

    Article  Google Scholar 

  15. Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. IEEE Transaction on Knowledge and Data Engineering 12(2), 160–173 (2000)

    Article  Google Scholar 

  16. Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Symp. on Fault-Tolerant Computing, pp. 48–55 (1999)

    Google Scholar 

  17. Russ, S.H., Robinson, J., Flachs, B.K., Heckel, B.: The Hector distributed run-time environment. IEEE Trans. on Parallel and Distributed Systems 9(11), 1102–1114 (1998)

    Article  Google Scholar 

  18. Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: Proc. the Int’l Parallel Processing Symp., pp. 526–531 (April 1996)

    Google Scholar 

  19. Zandy, V.: ckpt library, http://www.cs.wisc.edu/zandy/ckpt/

  20. Zwaenepoel, W., Elnozahy, E.N.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers C-41(5), 526–531 (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Woo, N., Jung, H., Shin, D., Han, H., Yeom, H.Y., Park, T. (2005). Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_12

Download citation

  • DOI: https://doi.org/10.1007/11408901_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25723-3

  • Online ISBN: 978-3-540-32019-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics