A fault detection service for wide area distributed computations

Stelling, Paul; DeMatteis, Cheryl; Foster, Ian; Kesselman, Carl; Lee, Craig; von Laszewski, Gregor

doi:10.1023/A:1019070407281

A fault detection service for wide area distributed computations

Published: September 1999

Volume 2, pages 117–128, (1999)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Paul Stelling¹,
Cheryl DeMatteis¹,
Ian Foster²,
Carl Kesselman³,
Craig Lee¹ &
…
Gregor von Laszewski²

342 Accesses
48 Citations
Explore all metrics

Abstract

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (1993) 37-53.
Article Google Scholar
J.-C. Bolot, Characterizing end-to-end packet delay and loss in the internet, Journal of High-Speed Networks 2(3) (1993) 305-323.
Google Scholar
M.S. Borella, D. Swider, S. Uludag and G. Brewster, Analysis of end-to-end internet packet loss: Dependence and asymmetry, Technical Report AT031798, 3Com Advanced Technologies Corporation (1998).
H. Casanova and J. Dongarra, Netsolve: A network server for solving computational science problems, Technical Report CS-95-313, University of Tennessee (November 1995).
T.D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM 43(2) (March 1996).
K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, A resource management architecture for metacomputing systems, in: The 4th Workshop on Job Scheduling Strategies for Parallel Processing (1998).
M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM 32(2) (April 1982).
I. Foster and C. Kesselman, The Globus project: A progress report, in: Proceedings of the Heterogeneous Computing Workshop (1998, to appear).
I. Foster and C. Kesselman, eds., The Grid: Blueprint for a Future Computing Infrastructure (Morgan Kaufmann, San Mateo, CA, 1998).
Google Scholar
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek and V. Sunderam, PVM: Parallel Virtual Machine — A User's Guide and Tutorial for Network Parallel Computing (MIT Press, Cambridge, MA, 1994).
Google Scholar
G.S. GmbH, CODINE: Computing in distributed networked environments (1995). http://www.genias.de/genias/english/codine.html.
A. Grimshaw, A. Nguyen-Tuong and W. Wulf, Campus-wide computing: Results using Legion at the University of Virginia, Technical Report CS-95-19, University of Virginia (1995).
M. Litzkow, M. Livny and M. Mutka, Condor — a hunter of idle workstations, in: Proc. of 8th Internat. Conf. on Distributed Computing Systems (1988) pp. 104-111.
K. Moore, G. Fagg, A. Geist and J. Dongarra, Scalable networked information processing environment (SNIPE), in: Proceedings of Supercomputing '97 (1997).
L. Moser, P. Melliar-Smith, D. Agarwal, R. Budhia and C. Lingley-Papadopoulos, Totem: A fault-tolerant multicast group communication system, Communications of the ACM 39(4) (1996).
A. Mukherjee, On the dynamics and significance of low-frequency components of network load, Internetworking: Research and Experience 5 (1994) 163-205.
Google Scholar
S. Mullender, ed., Distributed Systems (ACM Press, 1989).
V. Paxson, Measurements and analysis of end-to-end Internet dynamics, Ph.D. thesis, U.C. Berkeley (1997).
R. van Renesse, T. Hickey and K. Birman, Design and performance of Horus: A lightweight group communications system, Technical Report TR94-1442, Cornell University (1994).
J. Weissman, Gallop: The benefits of wide-area computing for parallel processing, Technical Report, University of Texas at San Antonio (1997).

Download references

Author information

Authors and Affiliations

The Aerospace Corporation, El Segundo, CA, 90245-4691, USA
Paul Stelling, Cheryl DeMatteis & Craig Lee
Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL, 60439, USA
Ian Foster & Gregor von Laszewski
Information Sciences Institute, University of Southern California, Marina del Rey, CA, 90292, USA
Carl Kesselman

Authors

Paul Stelling
View author publications
You can also search for this author in PubMed Google Scholar
Cheryl DeMatteis
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar
Carl Kesselman
View author publications
You can also search for this author in PubMed Google Scholar
Craig Lee
View author publications
You can also search for this author in PubMed Google Scholar
Gregor von Laszewski
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stelling, P., DeMatteis, C., Foster, I. et al. A fault detection service for wide area distributed computations. Cluster Computing 2, 117–128 (1999). https://doi.org/10.1023/A:1019070407281

Download citation

Issue Date: September 1999
DOI: https://doi.org/10.1023/A:1019070407281

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fault detection service for wide area distributed computations

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Serverless Computing: Current Trends and Open Problems

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fault detection service for wide area distributed computations

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

Serverless Computing: Current Trends and Open Problems

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation