skip to main content
10.1145/3578178.3578241acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

Memory Usage Prediction of HPC Workloads Using Feature Engineering and Machine Learning

Published: 27 February 2023 Publication History

Abstract

In High Performance Computing (HPC) systems, numerous applications of varying scale and domain are scheduled to run concurrently, and share the available CPU and memory capacities among themselves. Applications whose run-time memory usage are not known a priori, are commonly allocated with significantly higher amounts of memory than actually needed, which leads to poor resource utilization and performance degradation of the overall system. In this paper, we disseminate our experience of performing user analysis and prediction over a large-scale resource utilization dataset to tightly estimate the memory requirements of a wide variety of applications in the Titan supercomputer system. By coupling our engineered features with random forest and XGBoost supervised machine learning techniques, our models respectively predict the correct class of memory usage in 89% and 90% of the validation data. Furthermore, more than 98% of users have 95% or better average prediction accuracy within one class tolerance range of the actual memory usage.

References

[1]
2022. Oak Ridge National Laboratory. https://www.ornl.gov. [Online; accessed 19-Sept-2022].
[2]
Eugene Agichtein, Eric Brill, Susan Dumais, and Robert Ragno. 2006. Learning user interaction models for predicting web search result preferences. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, 3–10.
[3]
Yahya Almardeny, Noureddine Boujnah, and Frances Cleary. 2020. A novel outlier detection method for multivariate data. IEEE Transactions on Knowledge and Data Engineering (2020).
[4]
Andrew Barry. 2013. Resource utilization reporting - gathering and evaluating HPC system usage. In Proceedings of Cray User Group Conference (CUG).
[5]
Arthur Bland, Wayne Joubert, Don Maxwell, Norbert Podhorszki, Jim Rogers, Galen Shipman, and Arnold Tharrington. 2017. Titan: 20-petaflop Cray XK7 at Oak Ridge National Laboratory. In Contemporary High Performance Computing. Chapman and Hall/CRC, 399–420.
[6]
Arthur S. Bland, J. Wells, O. E. Messer, Oscar R. Hernandez, and James H. Rogers. 2012. Titan : Early experience with the Cray XK 6 at Oak Ridge National Laboratory Buddy. In Proceedings of cray user group conference (CUG 2012). Cray User Group Stuttgart,Germany, 3–4.
[7]
Avishek Bose, Huichen Yang, William H Hsu, and Daniel Andresen. 2021. HPCGCN: A Predictive Framework on High Performance Computing Cluster Log Data Using Graph Convolutional Networks. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 4113–4118.
[8]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
[9]
Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 93–104.
[10]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
[11]
Sajal Dash, Arnab K. Paul, Sarp Oral, and Feiyi Wang. 2021. SMC 2021 Data Challenge : Analyzing Resource Utilization and User Behavior on Titan Supercomputer. https://doi.ccs.ornl.gov/ui/doi/334.
[12]
Joseph Emeras, Sébastien Varrette, Mateusz Guzek, and Pascal Bouvry. 2015. Evalix: classification and prediction of job resource consumption on HPC platforms. In Job Scheduling Strategies for Parallel Processing. Springer, 102–122.
[13]
Jerome H Friedman. 2002. Stochastic gradient boosting. Computational statistics & data analysis 38, 4 (2002), 367–378.
[14]
Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Machine learning based workload prediction in cloud computing. In 2020 29th international conference on computer communications and networks (ICCCN). IEEE, 1–9.
[15]
Markus Goldstein and Andreas Dengel. 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. KI-2012: poster and demo track 9 (2012).
[16]
Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. 2004. Outlier detection using k-nearest neighbour graph. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Vol. 3. IEEE, 430–433.
[17]
Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering cluster-based local outliers. Pattern recognition letters 24, 9-10 (2003), 1641–1650.
[18]
Sergio Iserte. 2021. An Study on the Resource Utilization and User Behavior on Titan Supercomputer. In Smoky Mountains Computational Sciences and Engineering Conference. Springer, 398–410.
[19]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 3149–3157.
[20]
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. 2009. Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-asia conference on knowledge discovery and data mining. Springer, 831–838.
[21]
Xiuqiao Li, Nan Qi, Yuanyuan He, and Bill McMillan. 2019. Practical resource usage prediction method for large memory jobs in hpc clusters. In Asian Conference on Supercomputing Frontiers. Springer, 1–18.
[22]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413–422.
[23]
Andréa Matsunaga and José AB Fortes. 2010. On the use of machine learning to predict the time and resources consumed by applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE, 495–504.
[24]
Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B Gibbons, and Christos Faloutsos. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings 19th international conference on data engineering (Cat. No. 03CH37405). IEEE, 315–326.
[25]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825–2830.
[26]
Ivy Peng, Ian Karlin, Maya Gokhale, Kathleen Shoga, Matthew Legendre, and Todd Gamblin. 2021. A Holistic View of Memory Utilization on HPC Systems: Current and Future Trends. In The International Symposium on Memory Systems (Washington DC, DC, USA) (MEMSYS 2021). Association for Computing Machinery, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3488423.3519336
[27]
Eduardo R Rodrigues, Renato LF Cunha, Marco AS Netto, and Michael Spriggs. 2016. Helping HPC users specify job memory requirements via machine learning. In 2016 Third International Workshop on HPC User Support Tools (HUST). IEEE, 6–13.
[28]
Allan Snavely, Laura Carrington, Nicole Wolter, Jesus Labarta, Rosa Badia, and Avi Purkayastha. 2002. A framework for performance modeling and prediction. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 21–21.
[29]
Taraneh Taghavi, Maria Lupetini, and Yaron Kretchmer. 2016. Compute job memory recommender system using machine learning. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 609–616.
[30]
Mohammed Tanash, Daniel Andresen, and William Hsu. 2021. AMPRO-HPCC: A Machine-Learning Tool for Predicting Resources on Slurm HPC Clusters. In The Fifteenth International Conference on Advanced Engineering Computing and Applications in Sciences ADVCOMP. 20–27.
[31]
Jian Tang, Zhixiang Chen, Ada Wai-chee Fu, and David Cheung. 2001. A robust outlier detection scheme for large data sets. In In 6th Pacific-Asia Conf. on Knowledge Discovery and Data Mining. Citeseer.
[32]
Armando Vieira. 2015. Predicting online user behaviour using deep learning algorithms. https://doi.org/10.48550/ARXIV.1511.06247
[33]
Feiyi Wang, Sarp Oral, Satyabrata Sen, and Neena Imam. 2019. Learning from Five-year Resource-Utilization Data of Titan System. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). 1–6. https://doi.org/10.1109/CLUSTER.2019.8891001
[34]
Timothy Wood, Ludmila Cherkasova, Kivanc Ozonat, and Prashant Shenoy. 2008. Profiling and modeling resource usage of virtualized applications. In ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, Springer Berlin Heidelberg, Berlin, Heidelberg, 366–387.
[35]
Yue Zhao, Zain Nasrullah, Maciej K Hryniewicki, and Zheng Li. 2019. LSCP: Locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 585–593.
[36]
Yue Zhao, Zain Nasrullah, and Zheng Li. 2019. Pyod: A python toolbox for scalable outlier detection. arXiv preprint arXiv:1901.01588(2019).

Cited By

View all
  • (2025)Application-Oriented Cloud Workload Prediction: A Survey and New PerspectivesTsinghua Science and Technology10.26599/TST.2024.901002430:1(34-54)Online publication date: Feb-2025
  • (2023)Accelerating Performance of GPU-based Workloads Using CXLProceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing10.1145/3589013.3596678(27-31)Online publication date: 10-Aug-2023

Index Terms

  1. Memory Usage Prediction of HPC Workloads Using Feature Engineering and Machine Learning

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        HPCAsia '23: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
        February 2023
        161 pages
        ISBN:9781450398053
        DOI:10.1145/3578178
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 February 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. High Performance Computing
        2. Memory Allocation Prediction
        3. Random Forest
        4. Supervised Learning
        5. Workload Dataset
        6. XGBoost

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        HPC ASIA 2023

        Acceptance Rates

        HPCAsia '23 Paper Acceptance Rate 15 of 34 submissions, 44%;
        Overall Acceptance Rate 69 of 143 submissions, 48%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)101
        • Downloads (Last 6 weeks)7
        Reflects downloads up to 14 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Application-Oriented Cloud Workload Prediction: A Survey and New PerspectivesTsinghua Science and Technology10.26599/TST.2024.901002430:1(34-54)Online publication date: Feb-2025
        • (2023)Accelerating Performance of GPU-based Workloads Using CXLProceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing10.1145/3589013.3596678(27-31)Online publication date: 10-Aug-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media