skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.7254· OSTI ID:1986033
ORCiD logo [1];  [1];  [2]
  1. Argonne National Laboratory (ANL), Argonne, IL (United States). Mathematics & Computer Science Division; Univ. of Chicago, IL (United States)
  2. Illinois Institute of Technology, Chicago, IL (United States)

Summary Energy‐efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power models. In this article, we use the modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and 10 machine learning methods to model and predict performance and power consumption and compare their prediction error rates. We use an algorithm‐based fault‐tolerant linear algebra code and a multilevel checkpointing fault‐tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experimental results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. By utilizing the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential application optimizations, and we predict theoretical outcomes of the optimizations. Based on two collected datasets, we analyze and compare the prediction accuracy in performance and power consumption using MuMMI and 10 machine learning methods.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Laboratory Directed Research and Development (LDRD) Program; National Science Foundation (NSF)
Grant/Contract Number:
AC02-06CH11357; CCF-1801856; CCF-2119203; DE‐AC02‐06CH11357; RAPIDS2
OSTI ID:
1986033
Alternate ID(s):
OSTI ID: 1983367
Journal Information:
Concurrency and Computation. Practice and Experience, Vol. 35, Issue 15; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English

References (30)

Combining Partial Redundancy and Checkpointing for HPC conference June 2012
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
Application power profiling on IBM Blue Gene/Q conference September 2013
Diskless checkpointing journal January 1998
ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications conference May 2013
Applied Predictive Modeling book January 2013
Soft error resilient QR factorization for hybrid system with GPGPU conference November 2011
Benchmarking Machine Learning Methods for Performance Modeling of Scientific Applications
  • Malakar, Preeti; Balaprakash, Prasanna; Vishwanath, Venkatram
  • 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) https://doi.org/10.1109/PMBS.2018.8641686
conference November 2018
kernlab - An S4 Package for Kernel Methods in R journal January 2004
Utilizing ensemble learning for performance and power modeling and improvement of parallel cancer deep learning CANDLE benchmarks journal July 2021
Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge journal March 2012
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems conference October 2012
A survey of power and energy efficient techniques for high performance numerical linear algebra operations journal December 2014
Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy journal February 2015
Algorithmic Cholesky factorization fault recovery conference April 2010
Algorithm-based fault tolerance applied to high performance computing journal April 2009
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
PoLiMEr conference November 2017
XGBoost: A Scalable Tree Boosting System conference January 2016
Ridge Regression: Biased Estimation for Nonorthogonal Problems journal February 1970
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
A linear algebraic model of algorithm-based fault tolerance journal January 1988
High performance linpack benchmark: a fault tolerant implementation without checkpointing conference January 2011
Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks conference August 2019
Algorithm-based fault tolerance for dense matrix factorizations
  • Du, Peng; Bouteiller, Aurelien; Bosilca, George
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145845
conference January 2012
machine. journal October 2001
Using Performance-Power Modeling to Improve Energy Efficiency of HPC Applications journal October 2016
Random Forests journal January 2001
PowerInsight - A commodity power measurement capability conference June 2013
Proactive fault tolerance for HPC with Xen virtualization conference January 2007