Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis

Kunkel, Julian; Betke, Eugen

doi:10.1007/978-3-030-90539-2_10

Julian Kunkel¹² &
Eugen Betke¹³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

International Conference on High Performance Computing

1595 Accesses
1 Citations

Abstract

One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.

In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we sketch a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms. A study is conducted to explore the effectiveness of the approach by investigating related jobs for a reference job. The data stem from DKRZ’s supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms bear the potential to identify similar jobs, but more testing is necessary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We found in preliminary experiments that 10 min reduces compute time and noise, i.e., the variation of the statistics when re-running the same job.
2.
This can be support staff or a data center user that was executing the job.
3.
The reason is that a few write calls transfer many bytes; less than our 90%-quantile, therefore, write calls will be set to 0.

References

Bang, J., et al.: HPC workload characterization using feature selection and clustering. In: Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics, pp. 33–40 (2020)
Google Scholar
Betke, E., Kunkel, J.: Classifying temporal characteristics of job I/O using machine learning techniques. J. High Perform. Comput. (1), January 2021. https://doi.org/10.5281/zenodo.4478960
Betke, E., Kunkel, J.: The importance of temporal behavior when classifying job IO patterns using machine learning techniques. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 191–205. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_12
Chapter Google Scholar
Chan, N.: A resource utilization analytics platform using grafana and telegraf for the savio supercluster. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), pp. 1–6 (2019)
Google Scholar
DeMasi, O., Samak, T., Bailey, D.H.: Identifying HPC codes via performance logs and machine learning. In: Proceedings of the First Workshop on Changing Landscapes in HPC Security, pp. 23–30 (2013)
Google Scholar
Emeras, J., Varrette, S., Guzek, M., Bouvry, P.: Evalix: classification and prediction of job resource consumption on HPC platforms. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 102–122. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_6
Chapter Google Scholar
Evans, T.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21. IEEE (2014)
Google Scholar
Halawa, M.S., Díaz Redondo, R.P., Fernández Vilas, A.: Unsupervised KPIs-based clustering of jobs in HPC data centers. Sensors 20(15), 4111 (2020)
Article Google Scholar
Khotanlou, H., Salarpour, A.: An empirical comparison of distance measures for multivariate time series clustering. Int. J. Eng. 31(2), 250–262 (2018)
Google Scholar
Kunkel, J.M., et al.: Tools for analyzing parallel I/O. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 49–70. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_4
Chapter Google Scholar
Liu, Z., et al.: Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12 (2020)
Google Scholar
Mendez, S., et al.: A new approach for analyzing I/O in parallel scientific applications. Comput. Sci. Technol. Ser. 18, 67–78 (2012)
Google Scholar
Morse, M.D., Patel, J.M.: An efficient and accurate method for evaluating time series similarity. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Article Google Scholar
Rodrigo, G.P., et al.: Towards understanding HPC users and systems: a NERSC case study. J. Parallel Distrib. Comput. 111, 206–221 (2018)
Article Google Scholar
Simakov, N.A., et al.: A workload analysis of NSF’s innovative HPC resources using XDMoD. In: arXiv preprint arXiv:1801.04306 (2018)
Turner, A., et al.: Analysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray’s LASSi and EPCC SAFE, October 2019
Google Scholar
Weber, M., Brendel, R., Wagner, M., Dietrich, R., Tschüter, R., Brunst, H.: Visual Comparison of trace files in vampir. In: Bhatele, A., Boehme, D., Levine, J.A., Malony, A.D., Schulz, M. (eds.) ESPT/VPA 2017-2018. LNCS, vol. 11027, pp. 105–121. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17872-7_7
Chapter Google Scholar
White, J.P., et al.: Automatic characterization of HPC job parallel filesystem I/O patterns. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp. 1–8 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Georg-August-Universität Göttingen/GWDG, Göttingen, Germany
Julian Kunkel
ECMWF, Reading, UK
Eugen Betke

Authors

Julian Kunkel
View author publications
You can also search for this author in PubMed Google Scholar
Eugen Betke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian Kunkel .

Editor information

Editors and Affiliations

University of Tennessee at Knoxville, Knowville, TN, USA
Heike Jagode
Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Hartwig Anzt
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief
University of Tennessee System, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kunkel, J., Betke, E. (2021). Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-90539-2_10
Published: 13 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics