Skip to main content
Log in

A fault detection service for wide area distributed computations

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (1993) 37-53.

    Article  Google Scholar 

  2. J.-C. Bolot, Characterizing end-to-end packet delay and loss in the internet, Journal of High-Speed Networks 2(3) (1993) 305-323.

    Google Scholar 

  3. M.S. Borella, D. Swider, S. Uludag and G. Brewster, Analysis of end-to-end internet packet loss: Dependence and asymmetry, Technical Report AT031798, 3Com Advanced Technologies Corporation (1998).

  4. H. Casanova and J. Dongarra, Netsolve: A network server for solving computational science problems, Technical Report CS-95-313, University of Tennessee (November 1995).

  5. T.D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM 43(2) (March 1996).

  6. K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, A resource management architecture for metacomputing systems, in: The 4th Workshop on Job Scheduling Strategies for Parallel Processing (1998).

  7. M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM 32(2) (April 1982).

  8. I. Foster and C. Kesselman, The Globus project: A progress report, in: Proceedings of the Heterogeneous Computing Workshop (1998, to appear).

  9. I. Foster and C. Kesselman, eds., The Grid: Blueprint for a Future Computing Infrastructure (Morgan Kaufmann, San Mateo, CA, 1998).

    Google Scholar 

  10. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek and V. Sunderam, PVM: Parallel Virtual Machine — A User's Guide and Tutorial for Network Parallel Computing (MIT Press, Cambridge, MA, 1994).

    Google Scholar 

  11. G.S. GmbH, CODINE: Computing in distributed networked environments (1995). http://www.genias.de/genias/english/codine.html.

  12. A. Grimshaw, A. Nguyen-Tuong and W. Wulf, Campus-wide computing: Results using Legion at the University of Virginia, Technical Report CS-95-19, University of Virginia (1995).

  13. M. Litzkow, M. Livny and M. Mutka, Condor — a hunter of idle workstations, in: Proc. of 8th Internat. Conf. on Distributed Computing Systems (1988) pp. 104-111.

  14. K. Moore, G. Fagg, A. Geist and J. Dongarra, Scalable networked information processing environment (SNIPE), in: Proceedings of Supercomputing '97 (1997).

  15. L. Moser, P. Melliar-Smith, D. Agarwal, R. Budhia and C. Lingley-Papadopoulos, Totem: A fault-tolerant multicast group communication system, Communications of the ACM 39(4) (1996).

  16. A. Mukherjee, On the dynamics and significance of low-frequency components of network load, Internetworking: Research and Experience 5 (1994) 163-205.

    Google Scholar 

  17. S. Mullender, ed., Distributed Systems (ACM Press, 1989).

  18. V. Paxson, Measurements and analysis of end-to-end Internet dynamics, Ph.D. thesis, U.C. Berkeley (1997).

  19. R. van Renesse, T. Hickey and K. Birman, Design and performance of Horus: A lightweight group communications system, Technical Report TR94-1442, Cornell University (1994).

  20. J. Weissman, Gallop: The benefits of wide-area computing for parallel processing, Technical Report, University of Texas at San Antonio (1997).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stelling, P., DeMatteis, C., Foster, I. et al. A fault detection service for wide area distributed computations. Cluster Computing 2, 117–128 (1999). https://doi.org/10.1023/A:1019070407281

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1019070407281

Keywords

Navigation