Abstract
One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.
In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we sketch a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms. A study is conducted to explore the effectiveness of the approach by investigating related jobs for a reference job. The data stem from DKRZ’s supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms bear the potential to identify similar jobs, but more testing is necessary.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We found in preliminary experiments that 10 min reduces compute time and noise, i.e., the variation of the statistics when re-running the same job.
- 2.
This can be support staff or a data center user that was executing the job.
- 3.
The reason is that a few write calls transfer many bytes; less than our 90%-quantile, therefore, write calls will be set to 0.
References
Bang, J., et al.: HPC workload characterization using feature selection and clustering. In: Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics, pp. 33–40 (2020)
Betke, E., Kunkel, J.: Classifying temporal characteristics of job I/O using machine learning techniques. J. High Perform. Comput. (1), January 2021. https://doi.org/10.5281/zenodo.4478960
Betke, E., Kunkel, J.: The importance of temporal behavior when classifying job IO patterns using machine learning techniques. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 191–205. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_12
Chan, N.: A resource utilization analytics platform using grafana and telegraf for the savio supercluster. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), pp. 1–6 (2019)
DeMasi, O., Samak, T., Bailey, D.H.: Identifying HPC codes via performance logs and machine learning. In: Proceedings of the First Workshop on Changing Landscapes in HPC Security, pp. 23–30 (2013)
Emeras, J., Varrette, S., Guzek, M., Bouvry, P.: Evalix: classification and prediction of job resource consumption on HPC platforms. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 102–122. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_6
Evans, T.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21. IEEE (2014)
Halawa, M.S., Díaz Redondo, R.P., Fernández Vilas, A.: Unsupervised KPIs-based clustering of jobs in HPC data centers. Sensors 20(15), 4111 (2020)
Khotanlou, H., Salarpour, A.: An empirical comparison of distance measures for multivariate time series clustering. Int. J. Eng. 31(2), 250–262 (2018)
Kunkel, J.M., et al.: Tools for analyzing parallel I/O. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 49–70. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_4
Liu, Z., et al.: Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12 (2020)
Mendez, S., et al.: A new approach for analyzing I/O in parallel scientific applications. Comput. Sci. Technol. Ser. 18, 67–78 (2012)
Morse, M.D., Patel, J.M.: An efficient and accurate method for evaluating time series similarity. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Rodrigo, G.P., et al.: Towards understanding HPC users and systems: a NERSC case study. J. Parallel Distrib. Comput. 111, 206–221 (2018)
Simakov, N.A., et al.: A workload analysis of NSF’s innovative HPC resources using XDMoD. In: arXiv preprint arXiv:1801.04306 (2018)
Turner, A., et al.: Analysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray’s LASSi and EPCC SAFE, October 2019
Weber, M., Brendel, R., Wagner, M., Dietrich, R., Tschüter, R., Brunst, H.: Visual Comparison of trace files in vampir. In: Bhatele, A., Boehme, D., Levine, J.A., Malony, A.D., Schulz, M. (eds.) ESPT/VPA 2017-2018. LNCS, vol. 11027, pp. 105–121. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17872-7_7
White, J.P., et al.: Automatic characterization of HPC job parallel filesystem I/O patterns. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp. 1–8 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kunkel, J., Betke, E. (2021). Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-90539-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)