Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Madireddy, Sandeep; Balaprakash, Prasanna; Carns, Philip; Latham, Robert; Ross, Robert; Snyder, Shane; Wild, Stefan M.

doi:10.1007/978-3-319-92040-5_10

Sandeep Madireddy¹⁷,
Prasanna Balaprakash¹⁷,
Philip Carns¹⁷,
Robert Latham¹⁷,
Robert Ross¹⁷,
Shane Snyder¹⁷ &
…
Stefan M. Wild¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

International Conference on High Performance Computing

2378 Accesses

Abstract

Parallel I/O hardware and software infrastructure is a key contributor to performance variability for applications running on large-scale HPC systems. This variability confounds efforts to predict application performance for characterization, modeling, optimization, and job scheduling. We propose a modeling approach that improves predictive ability by explicitly treating the variability and by leveraging the sensitivity of application parameters on performance to group applications with similar characteristics. We develop a Gaussian process-based machine learning algorithm to model I/O performance and its variability as a function of application and file system characteristics. We demonstrate the effectiveness of the proposed approach using data collected from the Edison system at the National Energy Research Scientific Computing Center. The results show that the proposed sensitivity-based models are better at prediction when compared with application-partitioned or unpartitioned models. We highlight modeling techniques that are robust to the outliers that can occur in production parallel file systems. Using the developed metrics and modeling approach, we provide insights into the file system metrics that have a significant impact on I/O performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

Article Open access 17 May 2023

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Article Open access 14 August 2021

Predicting number of threads using balanced datasets for openMP regions

Article Open access 30 April 2022

Notes

1.
https://github.com/jhammond/lltop.

References

Barker, K.J., Davis, K., Kerbyson, D.J.: Performance modeling in action: performance prediction of a Cray XT4 system during upgrade. In: International Symposium on Parallel & Distributed Processing, pp. 1–8. IEEE (2009)
Google Scholar
Behzad, B., Byna, S., Wild, S.M., Prabhat, M., Snir, M.: Improving parallel I/O autotuning with performance modeling. In: 23rd International Symposium on High-Performance Parallel and Distributed Computing, pp. 253–256. ACM (2014)
Google Scholar
Betke, E., Kunkel, J.: Real-time I/O-monitoring of HPC applications with SIOX, elasticsearch, Grafana and FUSE. In: Kunkel, J.M., Yokota, R., Taufer, M., Shalf, J. (eds.) ISC High Performance 2017. LNCS, vol. 10524, pp. 174–186. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67630-2_15
Chapter Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Cao, Z., Tarasov, V., Raman, H.P., Hildebrand, D., Zadok, E.: On the performance variation in modern storage stacks. In: FAST, pp. 329–344 (2017)
Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)
Google Scholar
DOE-ASCR: storage systems and input/output to support extreme scale science. In: DOE Workshops on Storage Systems and Input/Output (2014)
Google Scholar
Dorier, M., Antoniu, G., Cappello, F., Snir, M., Sisneros, R., Yildiz, O., Ibrahim, S., Peterka, T., Orf, L.: Damaris: addressing performance variability in data management for post-petascale simulations. ACM Trans. Parallel Comput. 3(3), 15:1–15:43 (2016)
Article Google Scholar
Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: 28th International Parallel and Distributed Processing Symposium, pp. 155–164. IEEE (2014)
Google Scholar
Feroz, F., Hobson, M., Cameron, E., Pettitt, A.: Importance nested sampling and the MultiNest algorithm. arXiv preprint arXiv:1306.2144 (2013)
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 2nd edn. CRC Press, Boca Raton (2014)
MATH Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article Google Scholar
Geurts, P., Louppe, G.: Learning to rank with extremely randomized trees. In: JMLR: Workshop and Conference Proceedings, vol. 14, pp. 49–61 (2011)
Google Scholar
Gulati, A., Merchant, A., Varman, P.J.: mClock: handling throughput variability for hypervisor IO scheduling. In: 9th USENIX Conference on Operating Systems Design and Implementation, pp. 437–450. USENIX Association (2010)
Google Scholar
Habib, S., Morozov, V., Finkel, H., Pope, A., Heitmann, K., Kumaran, K., Peterka, T., Insley, J., Daniel, D., Fasel, P., et al.: The universe at extreme scale: multi-petaflop sky simulation on the BG/Q. In: International Conference on High Performance Computing, Networking, Storage and Analysis, p. 4. IEEE (2012)
Google Scholar
Inacio, E.C., Barbetta, P.A., Dantas, M.A.: A statistical analysis of the performance variability of read/write operations on parallel file systems. Procedia Comput. Sci. 108, 2393–2397 (2017)
Article Google Scholar
Isaila, F., Balaprakash, P., Wild, S.M., Kimpe, D., Latham, R., Ross, R., Hovland, P.: Collective I/O tuning using analytical and machine learning models. In: International Conference on Cluster Computing, pp. 128–137. IEEE (2015)
Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article Google Scholar
Kunkel, J., Zimmer, M., Betke, E.: Predicting performance of non-contiguous I/O with machine learning. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 257–273. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20119-1_19
Chapter Google Scholar
Kuo, C.S., Nomura, A., Matsuoka, S., Shah, A., Wolf, F., Zhukov, I.: Environment matters: how competition for I/O among applications degrades their performance. IPSJ SIG Technical report 2013-HPC-142(11), 1–7 (2013)
Google Scholar
Lee, E.K., Katz, R.H.: An analytic performance model of disk arrays. In: ACM SIGMETRICS Performance Evaluation Review, vol. 21, pp. 98–109. ACM (1993)
Article Google Scholar
Lockwood, G.K., Snyder, S., Yoo, W., Harms, K., Nault, Z., Byna, S., Carns, P., Wright, N.J.: UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis. In: 2nd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems (PDSW-DISCS 2017) (2017)
Google Scholar
Lofstead, J., Zheng, F., Liu, Q., Klasky, S., Oldfield, R., Kordenbrock, T., Schwan, K., Wolf, M.: Managing variability in the IO performance of petascale storage systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. IEEE (2010)
Google Scholar
Madireddy, S., Balaprakash, P., Carns, P., Latham, R., Ross, R., Snyder, S., Wild, S.M.: Analysis and correlation of application I/O performance and system-wide I/O activity. In: International Conference on Networking, Architecture, and Storage, pp. 1–10. IEEE (2017)
Google Scholar
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947)
Article MathSciNet Google Scholar
van der Matthews, A.G.D.G., Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., Hensman, J.: GPflow: a gaussian process library using TensorFlow. J. Mach. Learn. Res. 18(40), 1–6 (2017)
MathSciNet MATH Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Snyder, S., Carns, P., Harms, K., Ross, R., Lockwood, G.K., Wright, N.J.: Modular HPC I/O characterization with Darshan. In: Workshop on Extreme-Scale Programming Tools (2016)
Google Scholar
Son, S.W., Sehrish, S., Liao, W., Oldfield, R., Choudhary, A.: Reducing I/O variability using dynamic I/O path characterization in petascale storage systems. J. Supercomput. 73(5), 2069–2097 (2017)
Article Google Scholar
Stein, M.L.: Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York (2012). https://doi.org/10.1007/978-1-4612-1494-6
Book Google Scholar
Xie, B., Huang, Y., Chase, J.S., Choi, J.Y., Klasky, S., Lofstead, J., Oral, S.: Predicting output performance of a petascale supercomputer. In: 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 181–192. ACM, New York (2017)
Google Scholar
Yildiz, O., Dorier, M., Ibrahim, S., Ross, R., Antoniu, G.: On the root causes of cross-application I/O interference in HPC storage systems. In: International Parallel and Distributed Processing Symposium, pp. 750–759. IEEE (2016)
Google Scholar

Download references

Acknowledgment

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Lemont, IL, 60439, USA
Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert Latham, Robert Ross, Shane Snyder & Stefan M. Wild

Authors

Sandeep Madireddy
View author publications
You can also search for this author in PubMed Google Scholar
Prasanna Balaprakash
View author publications
You can also search for this author in PubMed Google Scholar
Philip Carns
View author publications
You can also search for this author in PubMed Google Scholar
Robert Latham
View author publications
You can also search for this author in PubMed Google Scholar
Robert Ross
View author publications
You can also search for this author in PubMed Google Scholar
Shane Snyder
View author publications
You can also search for this author in PubMed Google Scholar
Stefan M. Wild
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandeep Madireddy .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, United Kingdom
Michèle Weiland
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
David Keyes
Technische Universität München, Garching bei München, Germany
Carsten Trinitis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Madireddy, S. et al. (2018). Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-92040-5_10
Published: 29 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92039-9
Online ISBN: 978-3-319-92040-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Predicting number of threads using balanced datasets for openMP regions

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Predicting number of threads using balanced datasets for openMP regions

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation