Experiences in autotuning matrix multiplication for energy minimization on GPUs

Anzt, Hartwig; Haugen, Blake; Kurzak, Jakub; Luszczek, Piotr; Dongarra, Jack

doi:10.1002/cpe.3516

Title: Experiences in autotuning matrix multiplication for energy minimization on GPUs

Journal Article · Wed May 20 00:00:00 EDT 2015 · Concurrency and Computation. Practice and Experience

DOI:https://doi.org/10.1002/cpe.3516· OSTI ID:1361296

Anzt, Hartwig ^[1]; Haugen, Blake ^[1]; Kurzak, Jakub ^[1];

^[1]; Dongarra, Jack ^[2]

Univ. of Tennessee, Knoxville, TN (United States). Dept. of Electrical Engineering and Computer Science
Univ. of Tennessee, Knoxville, TN (United States). Dept. of Electrical Engineering and Computer Science; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)

Summary In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. Copyright © 2015 John Wiley & Sons, Ltd.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; National Science Foundation (NSF); Nvidia Corporation (United States); Intel Corporation (United States); Advanced Micro Devices, Inc. (AMD) (United States); Russian Scientific Fund (Russian Federation)

Contributing Organization:: Univ. of Manchester (United Kingdom)

Grant/Contract Number:: AC05-00OR22725; SHF-1320603; N14-11-00190

OSTI ID:: 1361296

Alternate ID(s):: OSTI ID: 1401625

Journal Information:: Concurrency and Computation. Practice and Experience, Vol. 27, Issue 17; ISSN 1532-0626

Publisher:: WileyCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 10 works

Citation information provided by
Web of Science

References (26)

Auto-tuning a high-level language targeted to GPU codes Grauer-Gray, Scott; Xu, Lifan; Searles, Robert 2012 Innovative Parallel Computing (InPar) https://doi.org/10.1109/InPar.2012.6339595	conference	May 2012
Improving power efficiency of dense linear algebra algorithms on multi-core processors via slack control Alonso, Pedro; Dolz, Manuel F.; Mayo, Rafael Simulation (HPCS), 2011 International Conference on High Performance Computing & Simulation https://doi.org/10.1109/HPCSim.2011.5999861	conference	July 2011
Fast implementation of DGEMM on Fermi GPU Tan, Guangming; Li, Linchuan; Triechle, Sean Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063431	conference	January 2011
A new energy aware performance metric Bekas, Costas; Curioni, Alessandro Computer Science - Research and Development, Vol. 25, Issue 3-4 https://doi.org/10.1007/s00450-010-0119-z	journal	July 2010
Autotuning Stencil-Based Computations on GPUs Mametjanov, Azamat; Lowell, Daniel; Ma, Ching-Chen 2012 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2012.46	conference	September 2012
Energy-efficient execution of dense linear algebra algorithms on multi-core processors Alonso, Pedro; Dolz, Manuel F.; Mayo, Rafael Cluster Computing, Vol. 16, Issue 3 https://doi.org/10.1007/s10586-012-0215-x	journal	May 2012
Search Space Pruning Constraints Visualization Haugen, Blake; Kurzak, Jakub 2014 Second IEEE Working Conference on Software Visualization (VISSOFT) https://doi.org/10.1109/VISSOFT.2014.15	conference	September 2014
Quantifying the energy cost of data movement in scientific applications Kestor, Gokcen; Gioiosa, Roberto; Kerbyson, Darren J. 2013 IEEE International Symposium on Workload Characterization (IISWC) https://doi.org/10.1109/IISWC.2013.6704670	conference	September 2013
An Improved Magma Gemm For Fermi Graphics Processing Units Nath, Rajib; Tomov, Stanimire; Dongarra, Jack The International Journal of High Performance Computing Applications, Vol. 24, Issue 4 https://doi.org/10.1177/1094342010385729	journal	September 2010
Input-aware auto-tuning for directive-based GPU programming Magni, Alberto; Grewe, Dominik; Johnson, Nick Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6 https://doi.org/10.1145/2458523.2458530	conference	January 2013
Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors: Unveiling the performance-energy trade-off in iterative linear system solvers for multithreaded processors Aliaga, José I.; Anzt, Hartwig; Castillo, Maribel Concurrency and Computation: Practice and Experience, Vol. 27, Issue 4 https://doi.org/10.1002/cpe.3341	journal	September 2014
The LINPACK Benchmark: past, present and future Dongarra, Jack J.; Luszczek, Piotr; Petitet, Antoine Concurrency and Computation: Practice and Experience, Vol. 15, Issue 9 https://doi.org/10.1002/cpe.728	journal	January 2003
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture- GeForce GTX 680 Kurzak, Jakub; Luszczek, Pitor; Tomov, Stanimire https://doi.org/10.2172/1173292	report	April 2012
Improving the energy efficiency of sparse linear system solvers on multicore and manycore systems Anzt, H.; Quintana-Ortí, E. S. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2018 https://doi.org/10.1098/rsta.2013.0279	journal	June 2014
Algorithmic Time, Energy, and Power on Candidate HPC Compute Building Blocks Choi, Jee; Dukhan, Marat; Liu, Xing 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.54	conference	May 2014
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications Ge, Rong; Feng, Xizhou; Song, Shuaiwen IEEE Transactions on Parallel and Distributed Systems, Vol. 21, Issue 5 https://doi.org/10.1109/TPDS.2009.76	journal	May 2010
Energy Efficient Scheduling of Real-Time Tasks on Multicore Processors IEEE Transactions on Parallel and Distributed Systems, Vol. 19, Issue 11 https://doi.org/10.1109/TPDS.2008.104	journal	November 2008
RAPL: memory power estimation and capping David, Howard; Gorbatov, Eugene; Hanebutte, Ulf R. Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design - ISLPED '10 https://doi.org/10.1145/1840845.1840883	conference	January 2010
Power emulation based DVFS efficiency investigations for embedded systems Genser, Andreas; Bachmann, Christian; Steger, Christian 2010 International Symposium on System-on-Chip - SOC, 2010 International Symposium on System on Chip https://doi.org/10.1109/ISSOC.2010.5625559	conference	September 2010
Model-driven autotuning of sparse matrix-vector multiply on GPUs Choi, Jee W.; Singh, Amik; Vuduc, Richard W. ACM SIGPLAN Notices, Vol. 45, Issue 5 https://doi.org/10.1145/1837853.1693471	journal	May 2010
Understanding the Energy Consumption of Dynamic Random Access Memories Vogelsang, Thomas 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) https://doi.org/10.1109/MICRO.2010.42	conference	December 2010
Accelerating GPU Kernels for Dense Linear Algebra Nath, Rajib; Tomov, Stanimire; Dongarra, Jack Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-19328-6_10	book	January 2011
A survey of architectural techniques for DRAM power management Mittal, Sparsh International Journal of High Performance Systems Architecture, Vol. 4, Issue 2 https://doi.org/10.1504/IJHPSA.2012.050990	journal	January 2012
Resource-conscious scheduling for energy efficiency on multicore processors Merkel, Andreas; Stoess, Jan; Bellosa, Frank Proceedings of the 5th European conference on Computer systems - EuroSys '10 https://doi.org/10.1145/1755913.1755930	conference	January 2010
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs Matsumoto, Kazuya; Nakasato, Naohito; Sedukhin, Stanislav G. 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis https://doi.org/10.1109/SC.Companion.2012.59	conference	November 2012
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures Monakov, Alexander; Lokhmotov, Anton; Avetisyan, Arutyun High Performance Embedded Architectures and Compilers https://doi.org/10.1007/978-3-642-11515-8_10	book	January 2010

Cited By (2)

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Proceedings of the International Conference on Supercomputing - ICS '17 https://doi.org/10.1145/3079079.3079103	conference	January 2017
BOAST: A metaprogramming framework to produce portable and efficient computing kernels for HPC applications Videau, Brice; Pouget, Kevin; Genovese, Luigi The International Journal of High Performance Computing Applications, Vol. 32, Issue 1 https://doi.org/10.1177/1094342017718068	journal	August 2017

Similar Records

Overcoming element quality dependence of finite elements with adaptive extended stencil FEM (AES‐FEM)

Journal Article · Wed Mar 23 00:00:00 EDT 2016 · International Journal for Numerical Methods in Engineering · OSTI ID:1361296

Conley, Rebecca; Delaney, Tristan J.; Jiao, Xiangmin

Scenario analysis for techno-economic model development of U.S. offshore wind support structures

Journal Article · Thu Sep 22 00:00:00 EDT 2016 · Wind Energy · OSTI ID:1361296

Damiani, Rick; Ning, Andrew; Maples, Ben; +2 more

Acceleration of GPU-based Krylov solvers via data transfer reduction

Journal Article · Wed Apr 08 00:00:00 EDT 2015 · International Journal of High Performance Computing Applications · OSTI ID:1361296

Anzt, Hartwig; Tomov, Stanimire; Luszczek, Piotr; +2 more

Related Subjects

97 MATHEMATICS AND COMPUTING
automatic software tuning
hardware accelerators
matrix multiplication
power
energy

Title: Experiences in autotuning matrix multiplication for energy minimization on GPUs

Citation Formats

References (26)

Cited By (2)

Similar Records

Related Subjects