Skip to main content

An Infrastructure for Monitoring and Management in Computational Grids

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1915))

Abstract

We present the design and implementation of an infrastructure that enables monitoring of resources, services, and applications in a computational grid and provides a toolkit to help manage these entities when faults occur. This infrastructure builds on three basic monitoring components: sensors to perform measurements, actuators to perform actions, and an event service to communicate events between remote processes. We describe how we apply our infrastructure to support a grid service and an application: (1) the Globus Metacomputing Directory Service; and (2) a long-running and coarse-grained parameter study application. We use these application to show that our monitoring infrastructure is highly modular, conveniently retargettable, and extensible.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Devesh Bhatt, Rakesh Jha, Todd Steeves, Rashmi Bhatt, and David Wills, “SPI: An Instrumentation Development Environment for Parallel/Distributed Systems”,. Proc. of Int. Parallel Processing Symposium, April1995.

    Google Scholar 

  2. Chris Brooks, Brian Tierney, and William Johnston, “Java Agents for Distrib-uted System Management”,. LBNL Technical Report, Dec. 1997.

    Google Scholar 

  3. H. Chu and K. Nahrstedt, “CPU Service Classes for Multimedia Applications”,. Proc. of IEEE Multimedia Computing and Applications, Florence, Italy, June 1999.

    Google Scholar 

  4. Peter Dinda and David O.Hallaron, “An Evaluation of Linear Models for Host Load Prediction”,. Proc. of the 8th IEEE Symposium on High-Performance Distributed Computing (HPDC-8), Redondo Beach, California, Aug.1999.

    Google Scholar 

  5. Steven Fitzgerald, Ian Foster, Carl Kesselman, Gregor von Laszewski, Warren Smith, and Steven Tuecke, “A Directory Service for Configuring High-Perfor-mance Distributed Applications”,. Proc. of the 6th IEEE Symp. on High-Perfor-mance Distributed Computing, 1997, pp. 365.375.

    Google Scholar 

  6. Martin Gergeleit, J. Kaiser, and H. Streich, “DIRECT: Towards a Distributed Object-Oriented Real-Time Control System”, Technical Report, 1996. Avail-able from http://borneo.gmd.de:80/RS/Papers/direct/direct.html.

  7. David J. Korsmeyer and Joan D. Walton, “DARWIN V2. A Distributed Analytical System for Aeronautical Tests”, Proc. of the 20th AIAA Advanced Measurement and Ground Testing Tech. Conf., June 1998.

    Google Scholar 

  8. F. Lange, Reinhold Kroger, and Martin Gergeleit, “JEWEL: Design and Imple-mentation of a Distributed Measurement System”, IEEE Transactions on Par-allel and Distributed Systems, 3(6), November 1992, pp. 657–671. Also available on-line from http://borneo.gmd.de:80/RS/Papers/JEWEL/JEWEL.html.

    Google Scholar 

  9. Craig A. Lee, Rich Wolski, Ian Foster, Carl Kesselman, and James Stepanek, “A Network Performance Tool for Grid Environments”, Proc. of SC.99, Port-lan, Oregon, Nov. 13.19, 1999.

    Google Scholar 

  10. Clifford W. Mercer and Ragunathan Rajkumar, “Interactive Interface and RT-Mach Support for Monitoring and Controlling Resource Management”, Pro-ceedings of Real-Time Technology and Applications Symposium, Chicago, Illi-nois, May15-17, 1995, pp. 134.139.

    Google Scholar 

  11. Barton P. Miller, Jonathan M. Cargille, R.Bruce Irvin, Krishna Kunchitha-padam, Mark D. Callaghan, Jeffrey K. Hollingsworth, Karen L. Karavanic, and Tia Newhall, “The Paradyn Parallel Performance Measurement Tool”, IEEE Computer, 28(11), November 1995, pp. 37.46.

    Google Scholar 

  12. Huseyin Simitci, Daniel A. Reed, Ryan Fox, Mario Medina, James Oly, Nancy Tran, and Guoyi Wang, “A Framework for Adaptive Storage Input/Output on Computational Grids”,. Proc. of the 3rd Workshop on Runtime Systems for Par-allel Programming (RTSPP), April 1999.

    Google Scholar 

  13. Paul Stelling, Ian Foster, Carl Kesselman, Craig Lee, and Gregorvon Lasze-wski, “A Fault Detection Service for Wide Area Distributed Computations”, Proc. of the 7th IEEE Symp. on High Performance Distributed Computing, 1998, pp. 268–278.

    Google Scholar 

  14. Brian Tierney, William Jonston, Brian Crowley, Gary Hoo, Chris Brooks, and Dan Gunter, “The NetLogger Methodology for High Performance Distributed Systems Performance Analysis”, Proc. of IEEE High Performance Distributed Computing Conference (HPDC-7), July 1998.

    Google Scholar 

  15. Rich Wolski, Neil T. Spring, and Jim Hayes, “The Network Weather Service: A Distributed Resource Performance, Forcasting Service for Metacomputing”, Journal of Future Generation Computing Systems, 1999.

    Google Scholar 

  16. Jerry C. Yan, “Performance Tuning with AIMS.An Automated Instrumenta-tion and Monitoring System for Multicomputers”, Proc. of the Twenty-Seventh Hawaii Int. Conf. on System Sciences, Hawaii, January 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Waheed, A., Smith, W., George, J., Yan, J. (2000). An Infrastructure for Monitoring and Management in Computational Grids. In: Dwarkadas, S. (eds) Languages, Compilers, and Run-Time Systems for Scalable Computers. LCR 2000. Lecture Notes in Computer Science, vol 1915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40889-4_18

Download citation

  • DOI: https://doi.org/10.1007/3-540-40889-4_18

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41185-7

  • Online ISBN: 978-3-540-40889-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics