Abstract
We present the design and implementation of an infrastructure that enables monitoring of resources, services, and applications in a computational grid and provides a toolkit to help manage these entities when faults occur. This infrastructure builds on three basic monitoring components: sensors to perform measurements, actuators to perform actions, and an event service to communicate events between remote processes. We describe how we apply our infrastructure to support a grid service and an application: (1) the Globus Metacomputing Directory Service; and (2) a long-running and coarse-grained parameter study application. We use these application to show that our monitoring infrastructure is highly modular, conveniently retargettable, and extensible.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Devesh Bhatt, Rakesh Jha, Todd Steeves, Rashmi Bhatt, and David Wills, “SPI: An Instrumentation Development Environment for Parallel/Distributed Systems”,. Proc. of Int. Parallel Processing Symposium, April1995.
Chris Brooks, Brian Tierney, and William Johnston, “Java Agents for Distrib-uted System Management”,. LBNL Technical Report, Dec. 1997.
H. Chu and K. Nahrstedt, “CPU Service Classes for Multimedia Applications”,. Proc. of IEEE Multimedia Computing and Applications, Florence, Italy, June 1999.
Peter Dinda and David O.Hallaron, “An Evaluation of Linear Models for Host Load Prediction”,. Proc. of the 8th IEEE Symposium on High-Performance Distributed Computing (HPDC-8), Redondo Beach, California, Aug.1999.
Steven Fitzgerald, Ian Foster, Carl Kesselman, Gregor von Laszewski, Warren Smith, and Steven Tuecke, “A Directory Service for Configuring High-Perfor-mance Distributed Applications”,. Proc. of the 6th IEEE Symp. on High-Perfor-mance Distributed Computing, 1997, pp. 365.375.
Martin Gergeleit, J. Kaiser, and H. Streich, “DIRECT: Towards a Distributed Object-Oriented Real-Time Control System”, Technical Report, 1996. Avail-able from http://borneo.gmd.de:80/RS/Papers/direct/direct.html.
David J. Korsmeyer and Joan D. Walton, “DARWIN V2. A Distributed Analytical System for Aeronautical Tests”, Proc. of the 20th AIAA Advanced Measurement and Ground Testing Tech. Conf., June 1998.
F. Lange, Reinhold Kroger, and Martin Gergeleit, “JEWEL: Design and Imple-mentation of a Distributed Measurement System”, IEEE Transactions on Par-allel and Distributed Systems, 3(6), November 1992, pp. 657–671. Also available on-line from http://borneo.gmd.de:80/RS/Papers/JEWEL/JEWEL.html.
Craig A. Lee, Rich Wolski, Ian Foster, Carl Kesselman, and James Stepanek, “A Network Performance Tool for Grid Environments”, Proc. of SC.99, Port-lan, Oregon, Nov. 13.19, 1999.
Clifford W. Mercer and Ragunathan Rajkumar, “Interactive Interface and RT-Mach Support for Monitoring and Controlling Resource Management”, Pro-ceedings of Real-Time Technology and Applications Symposium, Chicago, Illi-nois, May15-17, 1995, pp. 134.139.
Barton P. Miller, Jonathan M. Cargille, R.Bruce Irvin, Krishna Kunchitha-padam, Mark D. Callaghan, Jeffrey K. Hollingsworth, Karen L. Karavanic, and Tia Newhall, “The Paradyn Parallel Performance Measurement Tool”, IEEE Computer, 28(11), November 1995, pp. 37.46.
Huseyin Simitci, Daniel A. Reed, Ryan Fox, Mario Medina, James Oly, Nancy Tran, and Guoyi Wang, “A Framework for Adaptive Storage Input/Output on Computational Grids”,. Proc. of the 3rd Workshop on Runtime Systems for Par-allel Programming (RTSPP), April 1999.
Paul Stelling, Ian Foster, Carl Kesselman, Craig Lee, and Gregorvon Lasze-wski, “A Fault Detection Service for Wide Area Distributed Computations”, Proc. of the 7th IEEE Symp. on High Performance Distributed Computing, 1998, pp. 268–278.
Brian Tierney, William Jonston, Brian Crowley, Gary Hoo, Chris Brooks, and Dan Gunter, “The NetLogger Methodology for High Performance Distributed Systems Performance Analysis”, Proc. of IEEE High Performance Distributed Computing Conference (HPDC-7), July 1998.
Rich Wolski, Neil T. Spring, and Jim Hayes, “The Network Weather Service: A Distributed Resource Performance, Forcasting Service for Metacomputing”, Journal of Future Generation Computing Systems, 1999.
Jerry C. Yan, “Performance Tuning with AIMS.An Automated Instrumenta-tion and Monitoring System for Multicomputers”, Proc. of the Twenty-Seventh Hawaii Int. Conf. on System Sciences, Hawaii, January 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Waheed, A., Smith, W., George, J., Yan, J. (2000). An Infrastructure for Monitoring and Management in Computational Grids. In: Dwarkadas, S. (eds) Languages, Compilers, and Run-Time Systems for Scalable Computers. LCR 2000. Lecture Notes in Computer Science, vol 1915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40889-4_18
Download citation
DOI: https://doi.org/10.1007/3-540-40889-4_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41185-7
Online ISBN: 978-3-540-40889-5
eBook Packages: Springer Book Archive