A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Haidar, Azzam; Dong, Tingxing Tim; Tomov, Stanimire; Luszczek, Piotr; Dongarra, Jack

doi:10.1007/978-3-319-20119-1_3

Azzam Haidar¹⁵,
Tingxing Tim Dong¹⁵,
Stanimire Tomov¹⁵,
Piotr Luszczek¹⁵ &
…
Jack Dongarra^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Included in the following conference series:

International Conference on High Performance Computing

3008 Accesses

Abstract

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU’s significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to $5\times $ speedup on the K40 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Batch QR Factorization on GPUs: Design, Optimization, and Tuning

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

Article 02 January 2017

Tall-and-Skinny QR Factorization for Clusters of GPUs Using High-Performance Building Blocks

Notes

1.
Historically, similar issues were associated with strong scaling [14] and were attributed to a fixed problem size.

References

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180(1), 012037 (2009)
Google Scholar
Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.: Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUS. In: Hwu, W.W. (ed.) GPU Computing Gems. Morgan Kaufmann, California (2010)
Google Scholar
Agullo, E., Dongarra, J., Nath, R.,Tomov, S.: Fully empirical autotuned qr factorization for multicore architectures (2011). CoRR, abs/1102.5328
Google Scholar
ACML - AMD Core Math Library (2014). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml
Anderson, M.J., Sheffield, D., Keutzer. K.: A predictive model for solving small linear algebra problems in gpu registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)
Google Scholar
Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1–10. Springer, Heidelberg (2007)
Google Scholar
Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: high performance dense linear algebra with OpenCL. In: The ACM International Conference Series, Atlanta, May 13–14 (2013). (submitted)
Google Scholar
Dong, T., Haidar, A., Luszczek, P., Harris, A., Tomov, S., Dongarra, J.: LU factorization of small matrices: accelerating batched DGETRF on the GPU. In: Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014), August 2014
Google Scholar
Dong, T., Haidar, A., Tomov, S., Dongarra, J.: A fast batched cholesky factorization on a GPU. In: Proceedings of 2014 International Conference on Parallel Processing (ICPP-2014), September 2014
Google Scholar
Dong, T., Dobrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: A step towards energy efficient computing: redesigning a hydrodynamic application on CPU-GPU. In: IEEE 28th International Parallel Distributed Processing Symposium (IPDPS) (2014)
Google Scholar
Dongarra, J., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S., YarKhan, A.: Model-driven one-sided factorizations on multicore accelerated systems. Int. J.Supercomputing Frontiers Innovations 1(1), 85 (2014)
Google Scholar
Peng, D., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)
Article Google Scholar
Oak Ridge Leadership Computing Facility. Annual report 2013–2014 (2014). https://www.olcf.ornl.gov/wp-content/uploads/2015/01/AR_2014_Small.pdf
Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)
Article Google Scholar
Haidar, A., Tomov, S., Dongarra, J., Solca, R., Schulthess, T.: A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. Int. J. High Perform. Comput. Appl. 28(2), 196–209 (2012)
Article Google Scholar
Haidar, A., Cao, C., Yarkhan, A., Luszczek, P., Tomov, S., Kabir, K., Dongarra, J.: Unified development for mixed multi-gpu and multi-coprocessor environments using a lightweight runtime environment. In: IPDPS 2014 Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 491–500. IEEE Computer Society, Washington, (2014)
Google Scholar
Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. Int. J. High Performance Comput. Appl. 18(1), 135–158 (2015). doi:10.1177/1094342014567546
Google Scholar
Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.: Optimization for performance and energy for batched matrix computations on GPUs. In: PPoPP 2015 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, ACM, San Francisco, February 2015
Google Scholar
Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.: Towards batched linear solvers on accelerated hardware platforms. In: PPoPP 2015 Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, San Francisco, February 2015
Google Scholar
Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)
Article Google Scholar
Matrix algebra on GPU and multicore architectures (MAGMA), MAGMA Release 1.6.1 (2015). http://icl.cs.utk.edu/magma/
Intel Pentium III Processor - Small Matrix Library (1999). http://www.intel.com/design/pentiumiii/sml/
Intel Math Kernel Library (2014). http://software.intel.com/intel-mkl/
Intel 64 and IA-32 architectures software developer’s manual, July 20 (2014). http://download.intel.com/products/processor/manual/
Keyes, D., Taylor, V.: NSF-ACCI task force on software for science and engineering, December 2010
Google Scholar
Liao, J.C., Khodayari, A., Zomorrodi, A.R., Maranas, C.D.: A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25C, 50–62 (2014)
Google Scholar
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part I. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)
Chapter Google Scholar
Messer, O.E.B., Harris, J.A., Parete-Koon, S., Chertkow, M.A.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Manninen, P., Öster, P. (eds.) PARA. LNCS, vol. 7782, pp. 92–106. Springer, Heidelberg (2013)
Chapter Google Scholar
Molero, J.M., Garzón, E.M., García, I., Quintana-Ortí, E.S, Plaza, A.: Poster: a batched Cholesky solver for local RX anomaly detection on GPUs. In: PUMPS (2013)
Google Scholar
Nath, R., Tomov,S., Dong, T., Dongarra, T.: Optimizing symmetric dense matrix-vectormultiplication on GPUs. In: Proceedings of 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis, November 2011
Google Scholar
Nath, R., Tomov, S., Dongarra, T.: Accelerating GPU kernels for dense linear algebra. In: VECPAR 2010 Proceedings of the 2009 International Meeting on High Performance Computing for Computational Science, pp. 22–25. Springer, Berkeley, June 2010
Google Scholar
Nath, R., Tomov, S., Dongarra, J.: An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)
Article Google Scholar
Nvidia visual profiler
Google Scholar
https://developer.nvidia.com/nvidia-management-library-nvml (2014)
CUBLAS (2014). http://docs.nvidia.com/cuda/cublas/
CUBLAS 6.5, January 2015. http://docs.nvidia.com/cuda/cublas/
Villa, O., Fatica, M., Gawande, N., Tumeo, A.: Power/performance trade-offs of small batched LU based solvers on GPUs. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 813–825. Springer, Heidelberg (2013)
Chapter Google Scholar
Nitin, V.O., Gawande, A., Tumeo, A.: Accelerating subsurface transport simulation on heterogeneous clusters. In: IEEE International Conference on Cluster Computing (CLUSTER 2013), pp. 23–27, Indiana, September 2013
Google Scholar
Rotem, E., Naveh, A., Rajwan, D., Ananthakrishnan, A., Weissmann, E.: Power-management architecture of the intel microarchitecture code-named sandy bridge. IEEE Micro. 32(2), 20–27 (2012). doi:10.1109/MM.2012.12. ISSN: 0272–1732
Article Google Scholar
Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parellel Comput. Syst. Appl. 36(5–6), 232–240 (2010). doi:10.1016/j.parco.2009.12.005
Article MATH Google Scholar
Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with GPU accelerators. In: Proceedings of the IEEE IPDPS 2010, pp. 1–8. IEEE Computer Society, Atlanta, 19–23 April 2010. doi:10.1109/IPDPSW.2010.5470941
Tomov, S., Dongarra, J.: Dense linear algebra for hybrid gpu-based systems. In: Kurzak, J., Bader, D.A., Dongarra, J. (eds.) Scientific Computing with Multicore and Accelerators. Chapman and Hall/CRC, UK (2010)
Google Scholar
Wainwright, I .: Optimized LU-decomposition with full pivot for small batched matrices, GTC 2013 - ID S3069. April 2013
Google Scholar
Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In: Proceedings of the International Conference on Computational Science, ICCS 2012, pp. 37–46. Procedia Computer Science, 9(0):37 (2012)
Google Scholar
Yeralan, S.N., Davis, T.A., Ranka, S.: Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical report (2013)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. ACI-1339822, the Department of Energy, and Intel. The results were obtained in part with the financial support of the Russian Scientific Fund, Agreement N14-11-00190.

Author information

Authors and Affiliations

University of Tennessee, Knoxville, USA
Azzam Haidar, Tingxing Tim Dong, Stanimire Tomov, Piotr Luszczek & Jack Dongarra
Oak Ridge National Laboratory, Oak Ridge, USA
Jack Dongarra
University of Manchester, Manchester, UK
Jack Dongarra

Authors

Azzam Haidar
View author publications
You can also search for this author in PubMed Google Scholar
Tingxing Tim Dong
View author publications
You can also search for this author in PubMed Google Scholar
Stanimire Tomov
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Luszczek
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Azzam Haidar .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Julian M. Kunkel
Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Thomas Ludwig

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haidar, A., Dong, T.T., Tomov, S., Luszczek, P., Dongarra, J. (2015). A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-20119-1_3
Published: 20 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics