An Infrastructure for Monitoring and Management in Computational Grids

Waheed, Abdul; Smith, Warren; George, Jude; Yan, Jerry

doi:10.1007/3-540-40889-4_18

An Infrastructure for Monitoring and Management in Computational Grids

Abdul Waheed⁵,
Warren Smith⁶,
Jude George⁷ &
…
Jerry Yan⁸

Conference paper
First Online: 01 January 2002

272 Accesses
17 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1915))

Abstract

We present the design and implementation of an infrastructure that enables monitoring of resources, services, and applications in a computational grid and provides a toolkit to help manage these entities when faults occur. This infrastructure builds on three basic monitoring components: sensors to perform measurements, actuators to perform actions, and an event service to communicate events between remote processes. We describe how we apply our infrastructure to support a grid service and an application: (1) the Globus Metacomputing Directory Service; and (2) a long-running and coarse-grained parameter study application. We use these application to show that our monitoring infrastructure is highly modular, conveniently retargettable, and extensible.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Devesh Bhatt, Rakesh Jha, Todd Steeves, Rashmi Bhatt, and David Wills, “SPI: An Instrumentation Development Environment for Parallel/Distributed Systems”,. Proc. of Int. Parallel Processing Symposium, April1995.
Google Scholar
Chris Brooks, Brian Tierney, and William Johnston, “Java Agents for Distrib-uted System Management”,. LBNL Technical Report, Dec. 1997.
Google Scholar
H. Chu and K. Nahrstedt, “CPU Service Classes for Multimedia Applications”,. Proc. of IEEE Multimedia Computing and Applications, Florence, Italy, June 1999.
Google Scholar
Peter Dinda and David O.Hallaron, “An Evaluation of Linear Models for Host Load Prediction”,. Proc. of the 8th IEEE Symposium on High-Performance Distributed Computing (HPDC-8), Redondo Beach, California, Aug.1999.
Google Scholar
Steven Fitzgerald, Ian Foster, Carl Kesselman, Gregor von Laszewski, Warren Smith, and Steven Tuecke, “A Directory Service for Configuring High-Perfor-mance Distributed Applications”,. Proc. of the 6th IEEE Symp. on High-Perfor-mance Distributed Computing, 1997, pp. 365.375.
Google Scholar
Martin Gergeleit, J. Kaiser, and H. Streich, “DIRECT: Towards a Distributed Object-Oriented Real-Time Control System”, Technical Report, 1996. Avail-able from http://borneo.gmd.de:80/RS/Papers/direct/direct.html.
David J. Korsmeyer and Joan D. Walton, “DARWIN V2. A Distributed Analytical System for Aeronautical Tests”, Proc. of the 20th AIAA Advanced Measurement and Ground Testing Tech. Conf., June 1998.
Google Scholar
F. Lange, Reinhold Kroger, and Martin Gergeleit, “JEWEL: Design and Imple-mentation of a Distributed Measurement System”, IEEE Transactions on Par-allel and Distributed Systems, 3(6), November 1992, pp. 657–671. Also available on-line from http://borneo.gmd.de:80/RS/Papers/JEWEL/JEWEL.html.
Google Scholar
Craig A. Lee, Rich Wolski, Ian Foster, Carl Kesselman, and James Stepanek, “A Network Performance Tool for Grid Environments”, Proc. of SC.99, Port-lan, Oregon, Nov. 13.19, 1999.
Google Scholar
Clifford W. Mercer and Ragunathan Rajkumar, “Interactive Interface and RT-Mach Support for Monitoring and Controlling Resource Management”, Pro-ceedings of Real-Time Technology and Applications Symposium, Chicago, Illi-nois, May15-17, 1995, pp. 134.139.
Google Scholar
Barton P. Miller, Jonathan M. Cargille, R.Bruce Irvin, Krishna Kunchitha-padam, Mark D. Callaghan, Jeffrey K. Hollingsworth, Karen L. Karavanic, and Tia Newhall, “The Paradyn Parallel Performance Measurement Tool”, IEEE Computer, 28(11), November 1995, pp. 37.46.
Google Scholar
Huseyin Simitci, Daniel A. Reed, Ryan Fox, Mario Medina, James Oly, Nancy Tran, and Guoyi Wang, “A Framework for Adaptive Storage Input/Output on Computational Grids”,. Proc. of the 3rd Workshop on Runtime Systems for Par-allel Programming (RTSPP), April 1999.
Google Scholar
Paul Stelling, Ian Foster, Carl Kesselman, Craig Lee, and Gregorvon Lasze-wski, “A Fault Detection Service for Wide Area Distributed Computations”, Proc. of the 7th IEEE Symp. on High Performance Distributed Computing, 1998, pp. 268–278.
Google Scholar
Brian Tierney, William Jonston, Brian Crowley, Gary Hoo, Chris Brooks, and Dan Gunter, “The NetLogger Methodology for High Performance Distributed Systems Performance Analysis”, Proc. of IEEE High Performance Distributed Computing Conference (HPDC-7), July 1998.
Google Scholar
Rich Wolski, Neil T. Spring, and Jim Hayes, “The Network Weather Service: A Distributed Resource Performance, Forcasting Service for Metacomputing”, Journal of Future Generation Computing Systems, 1999.
Google Scholar
Jerry C. Yan, “Performance Tuning with AIMS.An Automated Instrumenta-tion and Monitoring System for Multicomputers”, Proc. of the Twenty-Seventh Hawaii Int. Conf. on System Sciences, Hawaii, January 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

MRJ Technology Solutions, NASA Ames Research Center, Moffett Field, CA, 94035-1000
Abdul Waheed
Computer Sciences Corp., NASA Ames Research Center, Moffett Field, CA, 94035-1000
Warren Smith
FSC End2End, Inc., NASA Ames Research Center, Moffett Field, CA, 94035-1000
Jude George
NASA Ames Research Center, Moffett Field, CA, 94035-1000
Jerry Yan

Authors

Abdul Waheed
View author publications
You can also search for this author in PubMed Google Scholar
Warren Smith
View author publications
You can also search for this author in PubMed Google Scholar
Jude George
View author publications
You can also search for this author in PubMed Google Scholar
Jerry Yan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Rochester, Rochester, NY, 14627-0226, USA
Sandhya Dwarkadas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Waheed, A., Smith, W., George, J., Yan, J. (2000). An Infrastructure for Monitoring and Management in Computational Grids. In: Dwarkadas, S. (eds) Languages, Compilers, and Run-Time Systems for Scalable Computers. LCR 2000. Lecture Notes in Computer Science, vol 1915. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40889-4_18

Download citation

DOI: https://doi.org/10.1007/3-540-40889-4_18
Published: 26 July 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41185-7
Online ISBN: 978-3-540-40889-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics