skip to main content
10.1145/3583678.3596893acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

Evaluating HPC Job Run Time Predictions Using Application Input Parameters

Published: 27 June 2023 Publication History

Abstract

It is difficult to accurately predict application run times in high performance computing (HPC), yet these predictions have useful applications in job scheduling and user feedback. User-led predictions can be inaccurate for a variety of factors, including inexperience, user burden, and an incentive to overpredict. Most automated efforts consider standardized job inputs from submission scripts but ignore application input parameters. Application input parameters can greatly enhance run time prediction accuracy but have typically been avoided due to the need for manual, per-application parameter collection.
In this paper, we evaluate and compare the trade-offs between conventional, job script-based predictors and specialized, application input-based predictors. This is accomplished by testing 20 machine learning model variants and four traditional predictors against a suite of five applications. This suite includes four commonly used and representative proxy applications and one real-world application. For reproducibility and extensibility, we provide the source code of our testing framework and our data set, which, to the best of our knowledge, is the first known publicized data set to include application input parameters alongside standard job parameters. We determine that the random forest regressor offers the best trade-off between accuracy and training time among all tested model variants. We show that job parameters alone are insufficient to produce adequate predictions while application input parameters provide excellent results, as high as 99% R2, and typically outperform the use of job parameters alone.

References

[1]
Omar Aaziz, Jonathan Cook, and Mohammed Tanash. 2018. Modeling Expected Application Runtime for Characterizing and Assessing Job Performance. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). 543--551.
[2]
Anthony Michael Agelastos, Mahesh Rajan, Nathan Wichmann, Randy Baker, Stefan P. Domino, Erik W. Draeger, Sarah Anderson, Jacob Balma, S. Behling, Mike Berry, Pierre Carrier, Mike Davis, Kim McMahon, D. Sandness, Kevin Thomas, S. Warren, and T. Zhu. 2017. Performance on Trinity Phase 2 (a Cray XC40 utilizing Intel Xeon Phi processors) with Acceptance Applications and Benchmarks. (5 2017). https://www.osti.gov/biblio/1457905
[3]
Hyunjoon Cheon, Jinseung Ryu, Chan Yeol Park, and Yo-Sub Han. 2020. SW Runtime Estimation using Automata Theory and Deep Learning on HPC. In 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C). 7--12.
[4]
Co design center for Particle Applications. 2017. ExaMiniMD. https://github.com/ECP-copa/ExaMiniMD.
[5]
Joseph Emeras, Sébastien Varrette, Mateusz Guzek, and Pascal Bouvry. 2017. Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms. In Job Scheduling Strategies for Parallel Processing, Narayan Desai and Walfredo Cirne (Eds.). Springer International Publishing, Cham, 102--122.
[6]
Paul Fischer and Katherine Heisey. 2014. NEKbone. https://github.com/ECP-copa/ExaMiniMD.
[7]
Cristian Galleguillos, Alina Sîrbu, Zeynep Kiziltan, Ozalp Babaoglu, Andrea Borghesi, and Thomas Bridi. 2017. Data-driven job dispatching in HPC systems. In International Workshop on Machine Learning, Optimization, and Big Data. Springer, 449--461.
[8]
Frank Hutter, Lin Xu, Holger H. Hoos, and Kevin Leyton-Brown. 2014. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence 206 (2014), 79--111.
[9]
Kenneth Lamar, Alexander Goponenko, Christina Peterson, Benjamin A. Allan, Jim M. Brandt, and Damian Dechev. 2021. Backfilling HPC Jobs with a Multimodal-Aware Predictor. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). 618--622.
[10]
Glenn K. Lockwood. 2012. HACC I/O. https://github.com/glennklockwood/hacc-io/.
[11]
Andréa Matsunaga and José A.B. Fortes. 2010. On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 495--504.
[12]
Ryan McKenna, Stephen Herbein, Adam Moody, Todd Gamblin, and Michela Taufer. 2016. Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). 255--258.
[13]
Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018. Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations. In Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (Rome, Italy) (SIGSIM-PADS '18). Association for Computing Machinery, New York, NY, USA, 49--59.
[14]
Gence Ozer, Sarthak Garg, Neda Davoudi, Gabrielle Poerwawinata, Matthias Maiterth, Alessio Netti, and Daniele Tafani. 2020. Towards a Predictive Energy Model for HPC Runtime Systems Using Supervised Learning. In Euro-Par 2019: Parallel Processing Workshops, Ulrich Schwardmann, Christian Boehme, Dora B. Heras, Valeria Cardellini, Emmanuel Jeannot, Antonio Salis, Claudio Schifanella, Ravi Reddy Manumachu, Dieter Schwamborn, Laura Ricci, Oh Sangyoon, Thomas Gruber, Laura Antonelli, and Stephen L. Scott (Eds.). Springer International Publishing, Cham, 626--638.
[15]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825--2830.
[16]
Adrian Pope. 2017. SWFFT. https://git.cels.anl.gov/hacc/SWFFT.
[17]
Emilia Rosti, Giuseppe Serazzi, Evgenia Smirni, and Mark S. Squillante. 1998. The Impact of I/O on Program Behavior and Parallel Scheduling. SIGMETRICS Perform. Eval. Rev. 26, 1 (jun 1998), 56--65.
[18]
Warren Smith, Ian Foster, and Valerie Taylor. 2004. Predicting application run times with historical information. J. Parallel and Distrib. Comput. 64, 9 (2004), 1007--1016.
[19]
Aidan P. Thompson, H. Metin Aktulga, Richard Berger, Dan S. Bolintineanu, W. Michael Brown, Paul S. Crozier, Pieter J. in 't Veld, Axel Kohlmeyer, Stan G. Moore, Trung Dac Nguyen, Ray Shan, Mark J. Stevens, Julien Tranchida, Christian Trott, and Steven J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271 (2022), 108171.
[20]
Christian R. Trott, Simon D. Hammond, and Aidan P. Thompson. 2014. SNAP: Strong Scaling High Fidelity Molecular Dynamics Simulations on Leadership-Class Computing Platforms. In Supercomputing, Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer (Eds.). Springer International Publishing, Cham, 19--34.
[21]
Qiqi Wang, Jing Li, Shuo Wang, and Guibao Wu. 2019. A Novel Two-Step Job Runtime Estimation Method Based on Input Parameters in HPC System. In 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). 311--316.
[22]
Michael R. Wyatt, Stephen Herbein, Todd Gamblin, Adam Moody, Dong H. Ahn, and Michela Taufer. 2018. PRIONN: Predicting Runtime and IO Using Neural Networks. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP 2018). Association for Computing Machinery, New York, NY, USA, Article 46, 12 pages.
[23]
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44--60.

Cited By

View all
  • (2024)Workload-Adaptive Scheduling for Efficient Use of Parallel File Systems in High-Performance Computing ClustersSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00190(1506-1516)Online publication date: 17-Nov-2024
  • (2024)HPCAdvisor: A Tool for Assisting Users in Selecting HPC Resources in the CloudProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00087(629-637)Online publication date: 17-Nov-2024
  • (2024)Job Runtime Prediction: A Two-Stage Framework Beyond PQR2 with Fallback and Enhanced Classification2024 IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI62512.2024.00025(116-121)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '23: Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems
June 2023
221 pages
ISBN:9798400701221
DOI:10.1145/3583678
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. run time prediction
  2. machine learning
  3. high performance computing

Qualifiers

  • Research-article

Conference

DEBS '23

Acceptance Rates

Overall Acceptance Rate 145 of 583 submissions, 25%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)93
  • Downloads (Last 6 weeks)7
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Workload-Adaptive Scheduling for Efficient Use of Parallel File Systems in High-Performance Computing ClustersSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00190(1506-1516)Online publication date: 17-Nov-2024
  • (2024)HPCAdvisor: A Tool for Assisting Users in Selecting HPC Resources in the CloudProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00087(629-637)Online publication date: 17-Nov-2024
  • (2024)Job Runtime Prediction: A Two-Stage Framework Beyond PQR2 with Fallback and Enhanced Classification2024 IEEE 36th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI62512.2024.00025(116-121)Online publication date: 28-Oct-2024
  • (2024)JEM: An AI-based engine workflow to predict simulation’s execution time on HPC cluster2024 International Conference on Control, Automation and Diagnosis (ICCAD)10.1109/ICCAD60883.2024.10553971(1-5)Online publication date: 15-May-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media