ClusterCockpit — A web application for job-specific performance monitoring | IEEE Conference Publication | IEEE Xplore

ClusterCockpit — A web application for job-specific performance monitoring


Abstract:

Monitoring is a common component of HPC system software. Up to now, monitoring focused mainly on health checking and system level performance as well as on job scheduler ...Show More

Abstract:

Monitoring is a common component of HPC system software. Up to now, monitoring focused mainly on health checking and system level performance as well as on job scheduler information and was targeted towards system administrators. Recently job-specific performance monitoring based on hardware performance counter metrics has gained attention at academic HPC computing centers. HPC is becoming a mainstream tool that is also used by non-HPC experts, and HPC centers see a demand to check for pathological jobs and jobs with large optimization potential. The possibility to measure hardware performance counter data with negligible overhead allows assessment of efficient resource utilization and detection of pathological jobs. Pathological jobs are, e.g. jobs with errors in the batch script, jobs which do not terminate, jobs with severe load imbalance, or jobs that do not use any resources. This paper introduces ClusterCockpit, a web front-end tailor-made tool for job-specific performance monitoring. While many recent job-specific performance monitoring efforts concentrate on the measurement and data collection layers, ClusterCockpit provides a modern user interface targeted towards performance analysts as well as application users.
Date of Conference: 23-26 September 2019
Date Added to IEEE Xplore: 07 November 2019
ISBN Information:

ISSN Information:

Conference Location: Albuquerque, NM, USA

Contact IEEE to Subscribe

References

References is not available for this document.