Abstract
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not compromise the high performance objective of parallel processing. In this paper, we explore two options for achieving fault tolerance for a common class of parallel applications, single-program-multiple-data (SPMD). We quantitatively compare checkpoint-recovery and wide-area replication as a means of achieving fault tolerance. The experimental results obtained for a canonical SPMD application suggest that checkpoint-recovery may be preferable for small problems if local parallel disks are available, but wide-area replication outperforms checkpoint-recovery for larger-grain problems, precisely the problems most suited for the wide-area network environment. The results also show that it possible to accurately model and predict the overheads of the two methods1
This work was partially funded by grants NSF ACIR-9996418 and CDA-9633299, AFOSRF49620-96-1-0472.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
7.0 References
Bal, H. et al, “Optimizing Parallel Applications for Wide-Area Clusters,” Twelfth International Parallel Processing Symposium,” March 1998.
Casas, J. et al, “Adaptive Load Migration systems for PVM,” Supercomputing 1994.
Foster, I. and Kesselman, C., “Globus: A Metacomputing Infrastructure Toolkit,” International Journal of Supercomputing Applications, 11(2), 1997.
Grimshaw, A.S. and Wulf, W. A., “The Legion Vision of a Worldwide Virtual Computer,” Communications of the ACM, Vol. 40(1), 1997.
Jalote., P., “Fault Tolerance in Distributed Systems,” Prentice-Hall Publishers, Englewood Cliffs, New Jersey, 1994.
Litzkow, M.J. et al., “Condor-a hunter of idle workstations,” In Proceedings of the 8th International Conference on Distributed Computing Systems, June 1988.
Nguyen-Tuong, A. and Grimshaw, A.S., “Using Reflection to Incorporate Fault-Tolerance Techniques in Distributed Applications,” Computer Science Technical Report, University of Virginia, CS 98–34, 1998.
Stelling, P. et al., “A Fault Detection Service for Wide Area Distributed Computations,” Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, August 1998.
Vaidya, N.H., “Impact of Checkpoint Latency on Overhead Ratio of a Check-pointing Scheme,” IEEE Transactions on Computers, Vol. 46(8), August 1997.s
Weissman, J.B. and Womack, D. “Fault Tolerant Scheduling in Distributed Networks,” UTSA Technical Report, CS-96-10, October 1996.
Weissman, J.B., “Gallop: The Benefits of Wide-Area Computing for Parallel Processing,” Journal of Parallel and Distributed Computing, Vol. 54(2), November 1998.
Weissman, J.B., “Prophet: Automated Scheduling of SPMD Programs in Workstation Networks,” Concurrency: Practice and Experience, Vol. 11(6), May 1999.
Zandy, V., Miller, B. and Livny, M., “Process Hijacking,” Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, August 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Weissman, J.B. (2000). Fault Tolerant Wide-Area Parallel Computing. In: Rolim, J. (eds) Parallel and Distributed Processing. IPDPS 2000. Lecture Notes in Computer Science, vol 1800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45591-4_168
Download citation
DOI: https://doi.org/10.1007/3-540-45591-4_168
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67442-9
Online ISBN: 978-3-540-45591-2
eBook Packages: Springer Book Archive