skip to main content
10.1145/3491418.3535172acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Towards Practical, Generalizable Machine-Learning Training Pipelines to build Regression Models for Predicting Application Resource Needs on HPC Systems

Published: 08 July 2022 Publication History

Abstract

This paper explores the potential for cost-effectively developing generalizable and scalable machine-learning-based regression models for predicting the approximate execution time of an HPC application given its input data and parameters. This work examines: (a) to what extent models can be trained on scaled-down datasets on commodity environments and adapted to production environments, (b) to what extent models built for specific applications can generalize to other applications within a family, and (c) how the most appropriate model may change based on the type of data and its mix. As part of this work, we also describe and show the use of an automatable pipeline for generating the necessary training data and building the model.

References

[1]
Marcos Amarís, Raphael Y. de Camargo, Mohamed Dyab, Alfredo Goldman, and Denis Trystram. 2016. A comparison of GPU execution time prediction using machine learning and analytical modeling. In 2016 IEEE 15th International Symposium on Network Computing and Applications (NCA). 326–333. https://doi.org/10.1109/NCA.2016.7778637
[2]
Anthony M Bolger, Marc Lohse, and Bjoern Usadel. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 15 (2014), 2114–2120.
[3]
Ling Huang, Jinzhu Jia, B. Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression. 883–891.
[4]
Jeff S. McGough and Kyle Riley. 2004. Pattern formation in the Gray–Scott model. Nonlinear Analysis: Real World Applications 5, 1 (2004), 105–121. https://doi.org/10.1016/S1468-1218(03)00020-8
[5]
Kshitij Mehta, Bryce Allen, Matthew Wolf, Jeremy Logan, Eric Suchyta, Jong Choi, Keichi Takahashi, Igor Yakushin, Todd Munson, Ian Foster, and Scott Klasky. 2019. A Codesign Framework for Online Data Analysis and Reduction. In 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS). 11–20. https://doi.org/10.1109/WORKS49585.2019.00007
[6]
Tudor Miu and Paolo Missier. 2012. Predicting the Execution Time of Workflow Activities Based on Their Input Features. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 64–72. https://doi.org/10.1109/SC.Companion.2012.21
[7]
Farrukh Nadeem and Thomas Fahringer. 2013. Optimizing Execution Time Predictions of Scientific Workflow Applications in the Grid through Evolutionary Programming. Future Gener. Comput. Syst. 29, 4 (jun 2013), 926–935. https://doi.org/10.1016/j.future.2012.10.005
[8]
Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2(2006), 287–311.

Cited By

View all
  • (2024)Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC WorkloadsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670555(1-4)Online publication date: 17-Jul-2024
  • (2023)Towards Characterizing DNNs to Estimate Training Time using HARP (HPC Application Resource (runtime) PredictorPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3597607(483-485)Online publication date: 23-Jul-2023
  • (2023)Insights from the HARP Framework: Using an AI-Driven Approach for Efficient Resource Allocation in HPC Scientific WorkflowsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3597595(341-344)Online publication date: 23-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '22: Practice and Experience in Advanced Research Computing 2022: Revolutionary: Computing, Connections, You
July 2022
455 pages
ISBN:9781450391610
DOI:10.1145/3491418
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ML
  2. automated data generation
  3. execution time
  4. model scalability
  5. model transferability

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

PEARC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Upcoming Conference

PEARC '25
Practice and Experience in Advanced Research Computing
July 20 - 24, 2025
Columbus , OH , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC WorkloadsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670555(1-4)Online publication date: 17-Jul-2024
  • (2023)Towards Characterizing DNNs to Estimate Training Time using HARP (HPC Application Resource (runtime) PredictorPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3597607(483-485)Online publication date: 23-Jul-2023
  • (2023)Insights from the HARP Framework: Using an AI-Driven Approach for Efficient Resource Allocation in HPC Scientific WorkflowsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3597595(341-344)Online publication date: 23-Jul-2023
  • (2022)Establishing a Generalizable Framework for Generating Cost-Aware Training Data and Building Unique Context-Aware Walltime Prediction Regression Models2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00070(497-506)Online publication date: Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media