Skip to main content

Fault Detection Service Architecture for Grid Computing Systems

  • Conference paper
Computational Science and Its Applications – ICCSA 2004 (ICCSA 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3044))

Included in the following conference series:

Abstract

The ability to tolerate failures while effectively exploiting the grid computing resources in an scalable and transparent manner must be an integral part of grid computing infrastructure. Hence, fault-detection service is a necessary prerequisite to fault tolerance and fault recovery in grid computing. To this end, we present an scalable fault detection service architecture. The proposed fault-detection system provides services that monitors user applications, grid middlewares and the dynamically changing state of a collection of distributed resources. It reports summaries of this information to the appropriate agents on demand or instantaneously in the event of failures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tierney, B., Crowley, D., Gunter, M., Holding, J., Lee, M., Thompson, A.: Monitoring Sensor Management System for Grid Environments. In: Proceedings of HPDC, pp. 97–104 (2000)

    Google Scholar 

  2. Grimshaw, A., Ferrari, A., Knabe, F., Humphrey, M.: Wide-Area Computing: Resource sharing on a large scale. IEEE Computer 5, 29–37 (1999)

    Google Scholar 

  3. Namyoon, W., Soonho, C., Hyungsoo, J., Jungwhan, M., Heon, Y., Taesoon, P., Hyungwoo, P.: MPICH-GF: Providing Fault Tolerance on Grid Environments. In: Proceedings of CCGrid (2003)

    Google Scholar 

  4. James, F., Todd, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. In: Proceedings of HPDC’10 (2001), Available at http://www.cs.wisc.edu/condor/condorg/

  5. Soonwook, H.: A Generic Failure Detection Service for the Grid, Ph.D. thesis, institution = ”University of Southern California (2003)

    Google Scholar 

  6. Renesse, R., Minsky, Y., Hayden, M.: A Gossip-Style Failure Detection Service, Technical Report, TR98-1687 (1998)

    Google Scholar 

  7. Abawajy, J.H., Dandamudi, S.P.: A Reconfigurable Multi-Layered Grid Scheduling Infrastructure. In: Proceedings of PDPTA 2003, pp. 138–144 (2003)

    Google Scholar 

  8. Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications, Ph.D. thesis, The University of Vergina (2000)

    Google Scholar 

  9. Foster, I., Kesselman, C.: The Globus Project: A Status Report. In: Proceedings of Heterogeneous Computing Workshop, pp. 4–18 (1998)

    Google Scholar 

  10. Stelling, P., Foster, I., Kesselman, C., Lee, C., Laszewski, G.: A Fault Detection Service for Wide Area Distributed Computations. In: Proceedings of HPDC, pp. 268–278 (1998)

    Google Scholar 

  11. Foster, I.: The Grid: A New Infrastructure for 21st Century Science. Physics Today 2, 42–47 (2002)

    Article  Google Scholar 

  12. Li, M., Goldberg, D., Tao, W., Tamir, Y.: Fault-Tolerant Cluster Management For Reliable High-Performance Computing. In: Proceedings of Parallel and Distributed Computing and Systems, pp. 480–485 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abawajy, J.H. (2004). Fault Detection Service Architecture for Grid Computing Systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds) Computational Science and Its Applications – ICCSA 2004. ICCSA 2004. Lecture Notes in Computer Science, vol 3044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24709-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24709-8_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22056-5

  • Online ISBN: 978-3-540-24709-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics