Skip to main content
Log in

System-Level Resource Monitoring in High-Performance Computing Environments

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Low-overhead resource monitoring is key to the successful management of distributed high-performance computing environments, particularly when applications have well-defined quality of service (QoS) requirements. The dproc system-level monitoring mechanisms provide tools both for efficiently monitoring system-level events and for notifying remote hosts of events relevant to their operation. Implemented as extension to the Linux kernel, dproc provides several key functions. First, utilizing the familiar /proc virtual filesystem, dproc extends this interface with resource information collected from both local and remote hosts. Second, to predictably capture and distribute monitoring information, dproc uses a kernel-level group communication facility, termed KECho, which implements events and event channels. Third, and the focus of this paper, is dproc's run-time customizability for resource monitoring, which includes the generation and deployment of monitoring functionality within remote operating system kernels. Using dproc, we show that (a) data streams can be customized according to a client's resource availabilities (dynamic stream management), (b) by dynamically varying distributed monitoring (dynamic filtering of monitoring information), an appropriate balance can be maintained between monitoring overheads and application quality, and (c) by performing monitoring at kernel-level, the information captured enables decision making that takes into account the multiple resources used by applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. S. Agarwala, C. Poellabauer, J. Kong, K. Schwan and M.Wolf, “Resource-Aware Stream Management with the Customizable dproc Distributed Monitoring Mechanisms”, in: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Seattle, Washington, 2003, pp. 250–259.

  2. E. Al-Shaer, H. Abdel-Wahab and K. Maly, “HiFi: A New Monitoring Architecture for Distributed Systems Management”, in Proceedings of the 19th International IEEE Conference on Distributed Computing Systems, Austin, TX, 1999, pp. 171–178.

  3. R. Buyya, “PARMON: A Portable and Scalable Monitoring System for Clusters”, Software Practice and Experience Journal, Vol. 30, No. 7, pp. 723–739, 2000.

    Google Scholar 

  4. G. Eisenhauer, “Portable Self-Describing Binary Data Streams”, Technical Report GIT-CC-94-45, College of Computing, Georgia Institute of Technology, 1994. http://www.cc. gatech.edu/tech_reports

  5. G. Eisenhauer, “Dynamic Code Generation with the E-Code Language”, Technical Report GIT-CC-02-42, Georgia Institute of Technology, College of Computing, 2002.

  6. G. Eisenhauer, F. Bustamante and K. Schwan, “Event Services for High Performance Computing”, in: Proceedings of High Performance Distributed Computing (HPDC), 2000.

  7. G. Eisenhauer, F. Bustamante and K. Schwan, “Native Data Representations: An Efficient Wire Format for High Performance Computing”, IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 12, pp. 1234–1246, 2002.

    Google Scholar 

  8. W. Feng, M. Broxton, A. Engelhart and G. Hurwitz, “MAGNeT: A Tool for Debugging, Analysis and Reflection in Computing Systems”, in 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003.

  9. I. Foster and C. Kesselman, “Computational Grids”, in The GRID: Blueprint for a New Computing Infrastructure, Chapter 2, Morgan Kaufmann Publishers, 1999.

  10. I. Foster, C. Kesselman and S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Jounal of Supercomputer Applications, Vol. 15, No. 3, 2001.

  11. R. Fowler, A. Cox, S. Elnikety and W. Zwaenepoel, “Using Performance Reflection in Systems Software”, in HotOS IX: Ninth Workshop on Hot Topics in Operating, Lihue, Hawaii, USA, 2003.

  12. GANGLIA, “Ganglia Toolkit: A Distributed Monitoring and Execution System”. http://ganglia.sourceforge.net/

  13. C. Glasner, R. Huegl, B. Reitinger, D. Kranzmueller and J. Volkert, “The Monitoring and Steering Environment”, in {tiProceedings of the International Conference on Computational Science (ICCS)}, San Francisco, CA, 2001, pp. 781–790.

  14. W. Gu, G. Eisenhauer, K. Schwan and J. Vetter, “Falcon: On-line Monitoring and Steering of Large-Scale Parallel Programs”, Concurrency: Practice and Experience, Vol. 6, No. 2, 1998.

  15. S.M. Inc, “RPC: Remote Procedure Call Protocol Specification Version 2”, 1988. http://www.ietf.org/rfc/rfcl057.txt.

  16. V. Jacobson, C. Leres and S. McCanne, “Tcpdump”, Lawrence Berkeley Laboratory (LBL), Available from ftp://ee.lbl.gov/tcpdump.tar.Z.

  17. J. Jancic, C. Poellabauer, K. Schwan, M. Wolf and N. Bright, “dproc-Extensible Run-Time Resource Monitoring for Cluster Applications”, in {tiProceedings of the International Conference on Computational Science}, 2002.

  18. J. Leigh, G. Dawe, J. Talandis, E. He, S. Venkataraman, J. Ge, D. Sandin and T. DeFanti, “AGAVE: Access Grid Augmented Virtual Environment”, in {tiProceedings of AccessGrid Retreat}, Argonne, Illinois, 2001.

  19. C. Liao, M. Martonosi and D.W. Clark, “Performance Monitoring in a Myrinet-connected Shrimp Cluster”, in {tiProceedings of 2nd SIGMETRICS Symposium on Parallel and Distributed Tools}, 1998, pp. 21–29.

  20. B. Lowekamp, N.Miller, R. Karrer, T. Gross and P. Steenkiste, “Design, Implementation, and Evaluation of the Remos Network Monitoring System”, Journal of Grid Computing, Vol. 1, No. 1, 2003, pp. 75–93.

    Google Scholar 

  21. M. Mansouri-Samani and M. Sloman, “A Generalised Event Monitoring Lanaguage for Distributed Systems”, IEE/IOP/BCS Distributed Systems Engineering Journal, Vol. 4, No. 2, pp. 96–108, 1997.

    Google Scholar 

  22. B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam and T. Newhall, “The Paradyn Parallel Performance Measurement Tool”, IEEE Computer, Vol. 28, No. 11, pp. 37–46, 1995.

    Google Scholar 

  23. Z. Nemeth and V. Sunderam, “Characterizing Grids: Attributes, Definitions, and Formalisms”, Journal of Grid Computing, Vol. 1, No. 1, pp. 9–23, 2003.

    Google Scholar 

  24. C. Poellabauer, H. Abbasi and K. Schwan, “Cooperative Runtime Management of Adaptive Applications and Distributed Resources”, in {tiProceedings of the 10th ACM Multimedia Conference}, Juan-les-Pins, France, 2002, pp. 402–411.

  25. C. Poellabauer, K. Schwan, S. Agarwala, A. Gavrilovska, G. Eisenhauer, S. Pande, C. Pu and M. Wolf, “Service Morphing: Integrated System-and Application-Level Service Adaptation in Autonomic Systems”, in {tiProceedings of the 5th Annual International Workshop on Active Middleware Services (AMS)}, 2003.

  26. C. Poellabauer, K. Schwan, G. Eisenhauer and J. Kong, “KECho-Event Communication for Distributed Kernel Services”, in {tiProceedings of the International Conference on Architecture of Computing Systems (ARCS'02)}, Karlsruhe, Germany, 2002.

  27. D.A. Reed, R.A. Aydt, R.J. Noe, P.C. Roth, K.A. Shields, B.W. Schwartz and L.F. Tavera, “Scalable Performance Analysis: The Pablo Performance Analysis Environment”, in {tiProceedings of the Scalable Parallel Libraries Conference}, 1993, pp. 104–113.

  28. D. Rosu, K. Schwan and S. Yalamanchili, “FARA-A Framework for Adaptive Resource Allocation in Complex Real-Time Systems”, in {tiProceedings of the 4th IEEE Real-Time Technology and Applications Symposium (RTAS)}, Denver, USA, 1998, pp. 79–84.

  29. D. Rosu, K. Schwan, S. Yalamanchili and R. Jha, “On Adaptive Resource Allocation for Complex Real-Time Applications”, in {tiProceedings of the 18th IEEE Real-Time Systems Symposium (RTSS)}, San Francisco, USA, 1997, pp. 320–329.

  30. M. Sottile and R. Minnich, “Supermon: A High-Speed Cluster Monitoring System”, in {tiProceedings of IEEE International Conference on Cluster Computing}, 2002.

  31. P. Uthayopas, S. Phaisithbenchapol and K. Chongbarirux, “Building a Resources Monitoring System for SMILE Beowulf Cluster”, in Proceeding of the Third International Conference/Exhibition on High Performance Computing in Asia-Pacific Region (HPC ASIA'99), Singapore, 1998.

  32. M. Wolf, Z. Cai, W. Huang and K. Schwan, “SmartPointers: Personalized Scientific Data Portals in your Hand”, in: Proceedings of ACM Supercomputing, 2002.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agarwala, S., Poellabauer, C., Kong, J. et al. System-Level Resource Monitoring in High-Performance Computing Environments. Journal of Grid Computing 1, 273–289 (2003). https://doi.org/10.1023/B:GRID.0000035189.80518.5d

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:GRID.0000035189.80518.5d

Navigation