Skip to main content

Fault Tolerant Wide-Area Parallel Computing

  • Conference paper
  • First Online:
Parallel and Distributed Processing (IPDPS 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1800))

Included in the following conference series:

Abstract

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods1

This work was partially funded by grants NSF ACIR-9996418 and CDA-9633299, AFOSRF49620-96-1-0472.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

7.0 References

  1. Bal, H. et al, “Optimizing Parallel Applications for Wide-Area Clusters,” Twelfth International Parallel Processing Symposium,” March 1998.

    Google Scholar 

  2. Casas, J. et al, “Adaptive Load Migration systems for PVM,” Supercomputing 1994.

    Google Scholar 

  3. Foster, I. and Kesselman, C., “Globus: A Metacomputing Infrastructure Toolkit,” International Journal of Supercomputing Applications, 11(2), 1997.

    Google Scholar 

  4. Grimshaw, A.S. and Wulf, W. A., “The Legion Vision of a Worldwide Virtual Computer,” Communications of the ACM, Vol. 40(1), 1997.

    Google Scholar 

  5. Jalote., P., “Fault Tolerance in Distributed Systems,” Prentice-Hall Publishers, Englewood Cliffs, New Jersey, 1994.

    Google Scholar 

  6. Litzkow, M.J. et al., “Condor-a hunter of idle workstations,” In Proceedings of the 8th International Conference on Distributed Computing Systems, June 1988.

    Google Scholar 

  7. Nguyen-Tuong, A. and Grimshaw, A.S., “Using Reflection to Incorporate Fault-Tolerance Techniques in Distributed Applications,” Computer Science Technical Report, University of Virginia, CS 98–34, 1998.

    Google Scholar 

  8. Stelling, P. et al., “A Fault Detection Service for Wide Area Distributed Computations,” Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, August 1998.

    Google Scholar 

  9. Vaidya, N.H., “Impact of Checkpoint Latency on Overhead Ratio of a Check-pointing Scheme,” IEEE Transactions on Computers, Vol. 46(8), August 1997.s

    Google Scholar 

  10. Weissman, J.B. and Womack, D. “Fault Tolerant Scheduling in Distributed Networks,” UTSA Technical Report, CS-96-10, October 1996.

    Google Scholar 

  11. Weissman, J.B., “Gallop: The Benefits of Wide-Area Computing for Parallel Processing,” Journal of Parallel and Distributed Computing, Vol. 54(2), November 1998.

    Google Scholar 

  12. Weissman, J.B., “Prophet: Automated Scheduling of SPMD Programs in Workstation Networks,” Concurrency: Practice and Experience, Vol. 11(6), May 1999.

    Google Scholar 

  13. Zandy, V., Miller, B. and Livny, M., “Process Hijacking,” Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, August 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Weissman, J.B. (2000). Fault Tolerant Wide-Area Parallel Computing. In: Rolim, J. (eds) Parallel and Distributed Processing. IPDPS 2000. Lecture Notes in Computer Science, vol 1800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45591-4_168

Download citation

  • DOI: https://doi.org/10.1007/3-540-45591-4_168

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67442-9

  • Online ISBN: 978-3-540-45591-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics