Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Wu, Xingfu; Taylor, Valerie; Lan, Zhiling

doi:10.1002/cpe.7254

Title: Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Journal Article · Fri Aug 05 00:00:00 EDT 2022 · Concurrency and Computation. Practice and Experience

DOI:https://doi.org/10.1002/cpe.7254· OSTI ID:1986033

^[1]; Taylor, Valerie ^[1]; Lan, Zhiling ^[2]

Argonne National Laboratory (ANL), Argonne, IL (United States). Mathematics & Computer Science Division; Univ. of Chicago, IL (United States)
Illinois Institute of Technology, Chicago, IL (United States)

Summary Energy‐efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power models. In this article, we use the modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and 10 machine learning methods to model and predict performance and power consumption and compare their prediction error rates. We use an algorithm‐based fault‐tolerant linear algebra code and a multilevel checkpointing fault‐tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experimental results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. By utilizing the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential application optimizations, and we predict theoretical outcomes of the optimizations. Based on two collected datasets, we analyze and compare the prediction accuracy in performance and power consumption using MuMMI and 10 machine learning methods.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States)

Sponsoring Organization:: USDOE Laboratory Directed Research and Development (LDRD) Program; National Science Foundation (NSF)

Grant/Contract Number:: AC02-06CH11357; CCF-1801856; CCF-2119203; DE‐AC02‐06CH11357; RAPIDS2

OSTI ID:: 1986033

Alternate ID(s):: OSTI ID: 1983367

Journal Information:: Concurrency and Computation. Practice and Experience, Vol. 35, Issue 15; ISSN 1532-0626

Publisher:: WileyCopyright Statement

Country of Publication:: United States

Language:: English

References (30)

Combining Partial Redundancy and Checkpointing for HPC Elliott, James; Kharbas, Kishor; Fiala, David 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS) https://doi.org/10.1109/ICDCS.2012.56	conference	June 2012
Algorithm-Based Fault Tolerance for Matrix Operations IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
Application power profiling on IBM Blue Gene/Q Wallace, Sean; Vishwanath, Venkatram; Coghlan, Susan 2013 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2013.6702682	conference	September 2013
Diskless checkpointing Plank, J. S.; Puening, M. A. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 10 https://doi.org/10.1109/71.730527	journal	January 1998
ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications el Mehdi Diouri, M.; Gluck, O.; Lefevre, L. 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing https://doi.org/10.1109/CCGrid.2013.80	conference	May 2013
Applied Predictive Modeling Kuhn, Max; Johnson, Kjell https://doi.org/10.1007/978-1-4614-6849-3	book	January 2013
Soft error resilient QR factorization for hybrid system with GPGPU Du, Peng; Luszczek, Piotr; Tomov, Stan Proceedings of the second workshop on Scalable algorithms for large-scale systems https://doi.org/10.1145/2133173.2133179	conference	November 2011
Benchmarking Machine Learning Methods for Performance Modeling of Scientific Applications Malakar, Preeti; Balaprakash, Prasanna; Vishwanath, Venkatram 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) https://doi.org/10.1109/PMBS.2018.8641686	conference	November 2018
kernlab - An S4 Package for Kernel Methods in R Karatzoglou, Alexandros; Smola, Alex; Hornik, Kurt Journal of Statistical Software, Vol. 11, Issue 9 https://doi.org/10.18637/jss.v011.i09	journal	January 2004
Utilizing ensemble learning for performance and power modeling and improvement of parallel cancer deep learning CANDLE benchmarks Wu, Xingfu; Taylor, Valerie Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.6516	journal	July 2021
Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge Rotem, Efraim; Naveh, Alon; Ananthakrishnan, Avinash IEEE Micro, Vol. 32, Issue 2 https://doi.org/10.1109/MM.2012.12	journal	March 2012
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems Meneses, Esteban; Sarood, Osman; Kale, Laxmikant V. 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing https://doi.org/10.1109/SBAC-PAD.2012.12	conference	October 2012
A survey of power and energy efficient techniques for high performance numerical linear algebra operations Tan, Li; Kothapalli, Shashank; Chen, Longxiang Parallel Computing, Vol. 40, Issue 10 https://doi.org/10.1016/j.parco.2014.09.001	journal	December 2014
Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy Bouteiller, Aurelien; Herault, Thomas; Bosilca, George ACM Transactions on Parallel Computing, Vol. 1, Issue 2 https://doi.org/10.1145/2686892	journal	February 2015
Algorithmic Cholesky factorization fault recovery Hakkarinen, Doug; Chen, Zizhong 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470436	conference	April 2010
Algorithm-based fault tolerance applied to high performance computing Bosilca, George; Delmas, Rémi; Dongarra, Jack Journal of Parallel and Distributed Computing, Vol. 69, Issue 4 https://doi.org/10.1016/j.jpdc.2008.12.002	journal	April 2009
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
PoLiMEr Marincic, Ivana; Vishwanath, Venkatram; Hoffmann, Henry Proceedings of the 5th International Workshop on Energy Efficient Supercomputing https://doi.org/10.1145/3149412.3149419	conference	November 2017
XGBoost: A Scalable Tree Boosting System Chen, Tianqi; Guestrin, Carlos Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 https://doi.org/10.1145/2939672.2939785	conference	January 2016
Ridge Regression: Biased Estimation for Nonorthogonal Problems Hoerl, Arthur E.; Kennard, Robert W. Technometrics, Vol. 12, Issue 1 https://doi.org/10.1080/00401706.1970.10488634	journal	February 1970
Evaluating the viability of process replication reliability for exascale systems Ferreira, Kurt; Stearley, Jon; Laros, James H. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443	conference	January 2011
A linear algebraic model of algorithm-based fault tolerance Anfinson, C. J.; Luk, F. T. IEEE Transactions on Computers, Vol. 37, Issue 12 https://doi.org/10.1109/12.9736	journal	January 1988
High performance linpack benchmark: a fault tolerant implementation without checkpointing Davies, Teresa; Karlsson, Christer; Liu, Hui Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995923	conference	January 2011
Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks Wu, Xingfu; Taylor, Valerie; Wozniak, Justin M. Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3337821.3337905	conference	August 2019
Algorithm-based fault tolerance for dense matrix factorizations Du, Peng; Bouteiller, Aurelien; Bosilca, George Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145845	conference	January 2012
machine. Friedman, Jerome H. The Annals of Statistics, Vol. 29, Issue 5 https://doi.org/10.1214/aos/1013203451	journal	October 2001
Using Performance-Power Modeling to Improve Energy Efficiency of HPC Applications Wu, Xingfu; Taylor, Valerie; Cook, Jeanine Computer, Vol. 49, Issue 10 https://doi.org/10.1109/MC.2016.311	journal	October 2016
Random Forests Breiman, Leo Machine Learning, Vol. 45, Issue 1, p. 5-32 https://doi.org/10.1023/A:1010933404324	journal	January 2001
PowerInsight - A commodity power measurement capability Laros, James H.; Pokorny, Phil; DeBonis, David 2013 International Green Computing Conference (IGCC), 2013 International Green Computing Conference Proceedings https://doi.org/10.1109/IGCC.2013.6604485	conference	June 2013
Proactive fault tolerance for HPC with Xen virtualization Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian Proceedings of the 21st annual international conference on Supercomputing - ICS '07 https://doi.org/10.1145/1274971.1274978	conference	January 2007

Similar Records

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems

Technical Report · Wed Nov 26 00:00:00 EST 2014 · OSTI ID:1986033

Schreiber, Robert

Development of the Breakthrough Vorcat Technology for Cloud-Based Complex Energy Simulations

Technical Report · Mon Jan 07 00:00:00 EST 2019 · OSTI ID:1986033

krispin, jacob

Development of the Breakthrough Vorcat Technologyfor Cloud-Based Complex Energy Simulations

Technical Report · Mon Jan 07 00:00:00 EST 2019 · OSTI ID:1986033

Krispin, Jacob

Related Subjects

97 MATHEMATICS AND COMPUTING
MuMMI
fault tolerant applications
machine learning
modeling
performance
power
prediction

Title: Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Citation Formats

References (30)

Similar Records

Related Subjects