System-Level Resource Monitoring in High-Performance Computing Environments

Agarwala, Sandip; Poellabauer, Christian; Kong, Jiantao; Schwan, Karsten; Wolf, Matthew

doi:10.1023/B:GRID.0000035189.80518.5d

System-Level Resource Monitoring in High-Performance Computing Environments

Published: September 2003

Volume 1, pages 273–289, (2003)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Sandip Agarwala¹,
Christian Poellabauer¹,
Jiantao Kong¹,
Karsten Schwan¹ &
…
Matthew Wolf¹

121 Accesses
6 Citations
Explore all metrics

Abstract

Low-overhead resource monitoring is key to the successful management of distributed high-performance computing environments, particularly when applications have well-defined quality of service (QoS) requirements. The dproc system-level monitoring mechanisms provide tools both for efficiently monitoring system-level events and for notifying remote hosts of events relevant to their operation. Implemented as extension to the Linux kernel, dproc provides several key functions. First, utilizing the familiar /proc virtual filesystem, dproc extends this interface with resource information collected from both local and remote hosts. Second, to predictably capture and distribute monitoring information, dproc uses a kernel-level group communication facility, termed KECho, which implements events and event channels. Third, and the focus of this paper, is dproc's run-time customizability for resource monitoring, which includes the generation and deployment of monitoring functionality within remote operating system kernels. Using dproc, we show that (a) data streams can be customized according to a client's resource availabilities (dynamic stream management), (b) by dynamically varying distributed monitoring (dynamic filtering of monitoring information), an appropriate balance can be maintained between monitoring overheads and application quality, and (c) by performing monitoring at kernel-level, the information captured enables decision making that takes into account the multiple resources used by applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software Cruising: A New Technology for Building Concurrent Software Monitor

Distributed Monitoring and Management of Exascale Systems in the Argo Project

The PerSyst Monitoring Tool

References

S. Agarwala, C. Poellabauer, J. Kong, K. Schwan and M.Wolf, “Resource-Aware Stream Management with the Customizable dproc Distributed Monitoring Mechanisms”, in: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Seattle, Washington, 2003, pp. 250–259.
E. Al-Shaer, H. Abdel-Wahab and K. Maly, “HiFi: A New Monitoring Architecture for Distributed Systems Management”, in Proceedings of the 19th International IEEE Conference on Distributed Computing Systems, Austin, TX, 1999, pp. 171–178.
R. Buyya, “PARMON: A Portable and Scalable Monitoring System for Clusters”, Software Practice and Experience Journal, Vol. 30, No. 7, pp. 723–739, 2000.
Google Scholar
G. Eisenhauer, “Portable Self-Describing Binary Data Streams”, Technical Report GIT-CC-94-45, College of Computing, Georgia Institute of Technology, 1994. http://www.cc. gatech.edu/tech_reports
G. Eisenhauer, “Dynamic Code Generation with the E-Code Language”, Technical Report GIT-CC-02-42, Georgia Institute of Technology, College of Computing, 2002.
G. Eisenhauer, F. Bustamante and K. Schwan, “Event Services for High Performance Computing”, in: Proceedings of High Performance Distributed Computing (HPDC), 2000.
G. Eisenhauer, F. Bustamante and K. Schwan, “Native Data Representations: An Efficient Wire Format for High Performance Computing”, IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 12, pp. 1234–1246, 2002.
Google Scholar
W. Feng, M. Broxton, A. Engelhart and G. Hurwitz, “MAGNeT: A Tool for Debugging, Analysis and Reflection in Computing Systems”, in 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003.
I. Foster and C. Kesselman, “Computational Grids”, in The GRID: Blueprint for a New Computing Infrastructure, Chapter 2, Morgan Kaufmann Publishers, 1999.
I. Foster, C. Kesselman and S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Jounal of Supercomputer Applications, Vol. 15, No. 3, 2001.
R. Fowler, A. Cox, S. Elnikety and W. Zwaenepoel, “Using Performance Reflection in Systems Software”, in HotOS IX: Ninth Workshop on Hot Topics in Operating, Lihue, Hawaii, USA, 2003.
GANGLIA, “Ganglia Toolkit: A Distributed Monitoring and Execution System”. http://ganglia.sourceforge.net/
C. Glasner, R. Huegl, B. Reitinger, D. Kranzmueller and J. Volkert, “The Monitoring and Steering Environment”, in {tiProceedings of the International Conference on Computational Science (ICCS)}, San Francisco, CA, 2001, pp. 781–790.
W. Gu, G. Eisenhauer, K. Schwan and J. Vetter, “Falcon: On-line Monitoring and Steering of Large-Scale Parallel Programs”, Concurrency: Practice and Experience, Vol. 6, No. 2, 1998.
S.M. Inc, “RPC: Remote Procedure Call Protocol Specification Version 2”, 1988. http://www.ietf.org/rfc/rfcl057.txt.
V. Jacobson, C. Leres and S. McCanne, “Tcpdump”, Lawrence Berkeley Laboratory (LBL), Available from ftp://ee.lbl.gov/tcpdump.tar.Z.
J. Jancic, C. Poellabauer, K. Schwan, M. Wolf and N. Bright, “dproc-Extensible Run-Time Resource Monitoring for Cluster Applications”, in {tiProceedings of the International Conference on Computational Science}, 2002.
J. Leigh, G. Dawe, J. Talandis, E. He, S. Venkataraman, J. Ge, D. Sandin and T. DeFanti, “AGAVE: Access Grid Augmented Virtual Environment”, in {tiProceedings of AccessGrid Retreat}, Argonne, Illinois, 2001.
C. Liao, M. Martonosi and D.W. Clark, “Performance Monitoring in a Myrinet-connected Shrimp Cluster”, in {tiProceedings of 2nd SIGMETRICS Symposium on Parallel and Distributed Tools}, 1998, pp. 21–29.
B. Lowekamp, N.Miller, R. Karrer, T. Gross and P. Steenkiste, “Design, Implementation, and Evaluation of the Remos Network Monitoring System”, Journal of Grid Computing, Vol. 1, No. 1, 2003, pp. 75–93.
Google Scholar
M. Mansouri-Samani and M. Sloman, “A Generalised Event Monitoring Lanaguage for Distributed Systems”, IEE/IOP/BCS Distributed Systems Engineering Journal, Vol. 4, No. 2, pp. 96–108, 1997.
Google Scholar
B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam and T. Newhall, “The Paradyn Parallel Performance Measurement Tool”, IEEE Computer, Vol. 28, No. 11, pp. 37–46, 1995.
Google Scholar
Z. Nemeth and V. Sunderam, “Characterizing Grids: Attributes, Definitions, and Formalisms”, Journal of Grid Computing, Vol. 1, No. 1, pp. 9–23, 2003.
Google Scholar
C. Poellabauer, H. Abbasi and K. Schwan, “Cooperative Runtime Management of Adaptive Applications and Distributed Resources”, in {tiProceedings of the 10th ACM Multimedia Conference}, Juan-les-Pins, France, 2002, pp. 402–411.
C. Poellabauer, K. Schwan, S. Agarwala, A. Gavrilovska, G. Eisenhauer, S. Pande, C. Pu and M. Wolf, “Service Morphing: Integrated System-and Application-Level Service Adaptation in Autonomic Systems”, in {tiProceedings of the 5th Annual International Workshop on Active Middleware Services (AMS)}, 2003.
C. Poellabauer, K. Schwan, G. Eisenhauer and J. Kong, “KECho-Event Communication for Distributed Kernel Services”, in {tiProceedings of the International Conference on Architecture of Computing Systems (ARCS'02)}, Karlsruhe, Germany, 2002.
D.A. Reed, R.A. Aydt, R.J. Noe, P.C. Roth, K.A. Shields, B.W. Schwartz and L.F. Tavera, “Scalable Performance Analysis: The Pablo Performance Analysis Environment”, in {tiProceedings of the Scalable Parallel Libraries Conference}, 1993, pp. 104–113.
D. Rosu, K. Schwan and S. Yalamanchili, “FARA-A Framework for Adaptive Resource Allocation in Complex Real-Time Systems”, in {tiProceedings of the 4th IEEE Real-Time Technology and Applications Symposium (RTAS)}, Denver, USA, 1998, pp. 79–84.
D. Rosu, K. Schwan, S. Yalamanchili and R. Jha, “On Adaptive Resource Allocation for Complex Real-Time Applications”, in {tiProceedings of the 18th IEEE Real-Time Systems Symposium (RTSS)}, San Francisco, USA, 1997, pp. 320–329.
M. Sottile and R. Minnich, “Supermon: A High-Speed Cluster Monitoring System”, in {tiProceedings of IEEE International Conference on Cluster Computing}, 2002.
P. Uthayopas, S. Phaisithbenchapol and K. Chongbarirux, “Building a Resources Monitoring System for SMILE Beowulf Cluster”, in Proceeding of the Third International Conference/Exhibition on High Performance Computing in Asia-Pacific Region (HPC ASIA'99), Singapore, 1998.
M. Wolf, Z. Cai, W. Huang and K. Schwan, “SmartPointers: Personalized Scientific Data Portals in your Hand”, in: Proceedings of ACM Supercomputing, 2002.

Download references

Author information

Authors and Affiliations

College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332, USA E-mail: sandip@cc.gatech.edu
Sandip Agarwala, Christian Poellabauer, Jiantao Kong, Karsten Schwan & Matthew Wolf

Authors

Sandip Agarwala
View author publications
You can also search for this author in PubMed Google Scholar
Christian Poellabauer
View author publications
You can also search for this author in PubMed Google Scholar
Jiantao Kong
View author publications
You can also search for this author in PubMed Google Scholar
Karsten Schwan
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Wolf
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agarwala, S., Poellabauer, C., Kong, J. et al. System-Level Resource Monitoring in High-Performance Computing Environments. Journal of Grid Computing 1, 273–289 (2003). https://doi.org/10.1023/B:GRID.0000035189.80518.5d

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1023/B:GRID.0000035189.80518.5d

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

System-Level Resource Monitoring in High-Performance Computing Environments

Abstract

Access this article

Similar content being viewed by others

Software Cruising: A New Technology for Building Concurrent Software Monitor

Distributed Monitoring and Management of Exascale Systems in the Argo Project

The PerSyst Monitoring Tool

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

System-Level Resource Monitoring in High-Performance Computing Environments

Abstract

Access this article

Similar content being viewed by others

Software Cruising: A New Technology for Building Concurrent Software Monitor

Distributed Monitoring and Management of Exascale Systems in the Argo Project

The PerSyst Monitoring Tool

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation