Batched matrix computations on hardware accelerators based on GPUs

Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr; Tomov, Stanimire; Dongarra, Jack

doi:10.1177/1094342014567546

Title: Batched matrix computations on hardware accelerators based on GPUs

Journal Article · Mon Feb 09 00:00:00 EST 2015 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342014567546· OSTI ID:1361289

Haidar, Azzam ^[1]; Dong, Tingxing ^[1]; Luszczek, Piotr ^[1]; Tomov, Stanimire ^[1]; Dongarra, Jack ^[2]

Univ. of Tennessee, Knoxville, TN (United States)
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. Finally, the tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; National Science Foundation (NSF); Nvidia Corporation (United States); Russian Scientific Fund (Russian Federation)

Contributing Organization:: Univ. of Manchester (United Kingdom)

Grant/Contract Number:: AC05-00OR22725; ACI-1339822; N14-11-00190

OSTI ID:: 1361289

Journal Information:: International Journal of High Performance Computing Applications, Vol. 29, Issue 2; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 32 works

Citation information provided by
Web of Science

References (7)

LAPACK Users' Guide Anderson, E.; Bai, Z.; Bischof, C. https://doi.org/10.1137/1.9780898719604	software	January 1999
Stability of Methods for Matrix Inversion Croz, Jeremy J. Du; Higham, Nicholas J. IMA Journal of Numerical Analysis, Vol. 12, Issue 1 https://doi.org/10.1093/imanum/12.1.1	journal	January 1992
Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge Rotem, Efraim; Naveh, Alon; Ananthakrishnan, Avinash IEEE Micro, Vol. 32, Issue 2 https://doi.org/10.1109/MM.2012.12	journal	March 2012
Sparsity: Optimization Framework for Sparse Matrix Kernels Im, Eun-Jin; Yelick, Katherine; Vuduc, Richard The International Journal of High Performance Computing Applications, Vol. 18, Issue 1 https://doi.org/10.1177/1094342004041296	journal	February 2004
Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects Agullo, Emmanuel; Demmel, Jim; Dongarra, Jack Journal of Physics: Conference Series, Vol. 180 https://doi.org/10.1088/1742-6596/180/1/012037	journal	July 2009
A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers Anderson, Michael J.; Sheffield, David; Keutzer, Kurt 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.11	conference	May 2012
A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU Dong, Tingxing; Dobrev, Veselin; Kolev, Tzanio 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.103	conference	May 2014

Cited By (3)

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Proceedings of the International Conference on Supercomputing - ICS '17 https://doi.org/10.1145/3079079.3079103	conference	January 2017
Hierarchical approach for deriving a reproducible unblocked LU factorization Iakymchuk, Roman; Graillat, Stef; Defour, David The International Journal of High Performance Computing Applications, Vol. 33, Issue 5 https://doi.org/10.1177/1094342019832968	journal	January 2019
Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Procedia Computer Science, Vol. 80 https://doi.org/10.1016/j.procs.2016.05.303	journal	January 2016

Similar Records

Towards Batched Linear Solvers on Accelerated Hardware Platforms

Book · Thu Jan 01 00:00:00 EST 2015 · OSTI ID:1361289

Haidar, Azzam; Dong, Tingzing Tim; Tomov, Stanimire; +1 more

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Book · Thu Jan 01 00:00:00 EST 2015 · OSTI ID:1361289

Dong, Tingzing Tim; Tomov, Stanimire Z; Luszczek, Piotr R; +1 more

HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

Journal Article · Thu Jan 01 00:00:00 EST 2015 · Scientific Programming · OSTI ID:1361289

Dongarra, Jack; Gates, Mark; Haidar, Azzam; +4 more

Related Subjects

97 MATHEMATICS AND COMPUTING
batched factorization
numerical linear algebra
hardware accelerators
numerical software libraries
one-sided factorization algorithms

Title: Batched matrix computations on hardware accelerators based on GPUs

Citation Formats

References (7)

Cited By (3)

Similar Records

Related Subjects