Fault Tolerant Wide-Area Parallel Computing

Weissman, Jon B.

doi:10.1007/3-540-45591-4_168

Jon B. Weissman²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1800))

Included in the following conference series:

International Parallel and Distributed Processing Symposium

1024 Accesses
11 Citations

Abstract

Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods¹

This work was partially funded by grants NSF ACIR-9996418 and CDA-9633299, AFOSRF49620-96-1-0472.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

7.0 References

Bal, H. et al, “Optimizing Parallel Applications for Wide-Area Clusters,” Twelfth International Parallel Processing Symposium,” March 1998.
Google Scholar
Casas, J. et al, “Adaptive Load Migration systems for PVM,” Supercomputing 1994.
Google Scholar
Foster, I. and Kesselman, C., “Globus: A Metacomputing Infrastructure Toolkit,” International Journal of Supercomputing Applications, 11(2), 1997.
Google Scholar
Grimshaw, A.S. and Wulf, W. A., “The Legion Vision of a Worldwide Virtual Computer,” Communications of the ACM, Vol. 40(1), 1997.
Google Scholar
Jalote., P., “Fault Tolerance in Distributed Systems,” Prentice-Hall Publishers, Englewood Cliffs, New Jersey, 1994.
Google Scholar
Litzkow, M.J. et al., “Condor-a hunter of idle workstations,” In Proceedings of the 8th International Conference on Distributed Computing Systems, June 1988.
Google Scholar
Nguyen-Tuong, A. and Grimshaw, A.S., “Using Reflection to Incorporate Fault-Tolerance Techniques in Distributed Applications,” Computer Science Technical Report, University of Virginia, CS 98–34, 1998.
Google Scholar
Stelling, P. et al., “A Fault Detection Service for Wide Area Distributed Computations,” Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, August 1998.
Google Scholar
Vaidya, N.H., “Impact of Checkpoint Latency on Overhead Ratio of a Check-pointing Scheme,” IEEE Transactions on Computers, Vol. 46(8), August 1997.s
Google Scholar
Weissman, J.B. and Womack, D. “Fault Tolerant Scheduling in Distributed Networks,” UTSA Technical Report, CS-96-10, October 1996.
Google Scholar
Weissman, J.B., “Gallop: The Benefits of Wide-Area Computing for Parallel Processing,” Journal of Parallel and Distributed Computing, Vol. 54(2), November 1998.
Google Scholar
Weissman, J.B., “Prophet: Automated Scheduling of SPMD Programs in Workstation Networks,” Concurrency: Practice and Experience, Vol. 11(6), May 1999.
Google Scholar
Zandy, V., Miller, B. and Livny, M., “Process Hijacking,” Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, August 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA
Jon B. Weissman

Authors

Jon B. Weissman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Centre Universitaire d’Informatique, Université de Genève, 24, rue Général Dufour, CH-1211, Genève 4, Switzerland
José Rolim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weissman, J.B. (2000). Fault Tolerant Wide-Area Parallel Computing. In: Rolim, J. (eds) Parallel and Distributed Processing. IPDPS 2000. Lecture Notes in Computer Science, vol 1800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45591-4_168

Download citation

DOI: https://doi.org/10.1007/3-540-45591-4_168
Published: 25 May 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67442-9
Online ISBN: 978-3-540-45591-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics