Skip to main content

Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

Abstract

One goal of support staff at a data center is to identify inefficient jobs and to improve their efficiency. Therefore, a data center deploys monitoring systems that capture the behavior of the executed jobs. While it is easy to utilize statistics to rank jobs based on the utilization of computing, storage, and network, it is tricky to find patterns in 100,000 jobs, i.e., is there a class of jobs that aren’t performing well. Similarly, when support staff investigates a specific job in detail, e.g., because it is inefficient or highly efficient, it is relevant to identify related jobs to such a blueprint. This allows staff to understand the usage of the exhibited behavior better and to assess the optimization potential.

In this article, our goal is to identify jobs similar to an arbitrary reference job. In particular, we sketch a methodology that utilizes temporal I/O similarity to identify jobs related to the reference job. Practically, we apply several previously developed time series algorithms. A study is conducted to explore the effectiveness of the approach by investigating related jobs for a reference job. The data stem from DKRZ’s supercomputer Mistral and include more than 500,000 jobs that have been executed for more than 6 months of operation. Our analysis shows that the strategy and algorithms bear the potential to identify similar jobs, but more testing is necessary.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We found in preliminary experiments that 10 min reduces compute time and noise, i.e., the variation of the statistics when re-running the same job.

  2. 2.

    This can be support staff or a data center user that was executing the job.

  3. 3.

    The reason is that a few write calls transfer many bytes; less than our 90%-quantile, therefore, write calls will be set to 0.

References

  1. Bang, J., et al.: HPC workload characterization using feature selection and clustering. In: Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics, pp. 33–40 (2020)

    Google Scholar 

  2. Betke, E., Kunkel, J.: Classifying temporal characteristics of job I/O using machine learning techniques. J. High Perform. Comput. (1), January 2021. https://doi.org/10.5281/zenodo.4478960

  3. Betke, E., Kunkel, J.: The importance of temporal behavior when classifying job IO patterns using machine learning techniques. In: Jagode, H., Anzt, H., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12321, pp. 191–205. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59851-8_12

    Chapter  Google Scholar 

  4. Chan, N.: A resource utilization analytics platform using grafana and telegraf for the savio supercluster. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), pp. 1–6 (2019)

    Google Scholar 

  5. DeMasi, O., Samak, T., Bailey, D.H.: Identifying HPC codes via performance logs and machine learning. In: Proceedings of the First Workshop on Changing Landscapes in HPC Security, pp. 23–30 (2013)

    Google Scholar 

  6. Emeras, J., Varrette, S., Guzek, M., Bouvry, P.: Evalix: classification and prediction of job resource consumption on HPC platforms. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 102–122. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_6

    Chapter  Google Scholar 

  7. Evans, T.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21. IEEE (2014)

    Google Scholar 

  8. Halawa, M.S., Díaz Redondo, R.P., Fernández Vilas, A.: Unsupervised KPIs-based clustering of jobs in HPC data centers. Sensors 20(15), 4111 (2020)

    Article  Google Scholar 

  9. Khotanlou, H., Salarpour, A.: An empirical comparison of distance measures for multivariate time series clustering. Int. J. Eng. 31(2), 250–262 (2018)

    Google Scholar 

  10. Kunkel, J.M., et al.: Tools for analyzing parallel I/O. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 49–70. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_4

    Chapter  Google Scholar 

  11. Liu, Z., et al.: Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12 (2020)

    Google Scholar 

  12. Mendez, S., et al.: A new approach for analyzing I/O in parallel scientific applications. Comput. Sci. Technol. Ser. 18, 67–78 (2012)

    Google Scholar 

  13. Morse, M.D., Patel, J.M.: An efficient and accurate method for evaluating time series similarity. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007)

    Google Scholar 

  14. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)

    Article  Google Scholar 

  15. Rodrigo, G.P., et al.: Towards understanding HPC users and systems: a NERSC case study. J. Parallel Distrib. Comput. 111, 206–221 (2018)

    Article  Google Scholar 

  16. Simakov, N.A., et al.: A workload analysis of NSF’s innovative HPC resources using XDMoD. In: arXiv preprint arXiv:1801.04306 (2018)

  17. Turner, A., et al.: Analysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray’s LASSi and EPCC SAFE, October 2019

    Google Scholar 

  18. Weber, M., Brendel, R., Wagner, M., Dietrich, R., Tschüter, R., Brunst, H.: Visual Comparison of trace files in vampir. In: Bhatele, A., Boehme, D., Levine, J.A., Malony, A.D., Schulz, M. (eds.) ESPT/VPA 2017-2018. LNCS, vol. 11027, pp. 105–121. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17872-7_7

    Chapter  Google Scholar 

  19. White, J.P., et al.: Automatic characterization of HPC job parallel filesystem I/O patterns. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp. 1–8 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julian Kunkel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kunkel, J., Betke, E. (2021). Toward a Workflow for Identifying Jobs with Similar I/O Behavior Utilizing Time Series Analysis. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-90539-2_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-90538-5

  • Online ISBN: 978-3-030-90539-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics