Abstract
As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU’s significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to \(5\times \) speedup on the K40 GPU.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Historically, similar issues were associated with strong scaling [14] and were attributed to a fixed problem size.
References
Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180(1), 012037 (2009)
Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.: Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUS. In: Hwu, W.W. (ed.) GPU Computing Gems. Morgan Kaufmann, California (2010)
Agullo, E., Dongarra, J., Nath, R.,Tomov, S.: Fully empirical autotuned qr factorization for multicore architectures (2011). CoRR, abs/1102.5328
ACML - AMD Core Math Library (2014). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml
Anderson, M.J., Sheffield, D., Keutzer. K.: A predictive model for solving small linear algebra problems in gpu registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)
Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1–10. Springer, Heidelberg (2007)
Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: high performance dense linear algebra with OpenCL. In: The ACM International Conference Series, Atlanta, May 13–14 (2013). (submitted)
Dong, T., Haidar, A., Luszczek, P., Harris, A., Tomov, S., Dongarra, J.: LU factorization of small matrices: accelerating batched DGETRF on the GPU. In: Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014), August 2014
Dong, T., Haidar, A., Tomov, S., Dongarra, J.: A fast batched cholesky factorization on a GPU. In: Proceedings of 2014 International Conference on Parallel Processing (ICPP-2014), September 2014
Dong, T., Dobrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: A step towards energy efficient computing: redesigning a hydrodynamic application on CPU-GPU. In: IEEE 28th International Parallel Distributed Processing Symposium (IPDPS) (2014)
Dongarra, J., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S., YarKhan, A.: Model-driven one-sided factorizations on multicore accelerated systems. Int. J.Supercomputing Frontiers Innovations 1(1), 85 (2014)
Peng, D., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)
Oak Ridge Leadership Computing Facility. Annual report 2013–2014 (2014). https://www.olcf.ornl.gov/wp-content/uploads/2015/01/AR_2014_Small.pdf
Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)
Haidar, A., Tomov, S., Dongarra, J., Solca, R., Schulthess, T.: A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. Int. J. High Perform. Comput. Appl. 28(2), 196–209 (2012)
Haidar, A., Cao, C., Yarkhan, A., Luszczek, P., Tomov, S., Kabir, K., Dongarra, J.: Unified development for mixed multi-gpu and multi-coprocessor environments using a lightweight runtime environment. In: IPDPS 2014 Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 491–500. IEEE Computer Society, Washington, (2014)
Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. Int. J. High Performance Comput. Appl. 18(1), 135–158 (2015). doi:10.1177/1094342014567546
Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.: Optimization for performance and energy for batched matrix computations on GPUs. In: PPoPP 2015 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, ACM, San Francisco, February 2015
Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.: Towards batched linear solvers on accelerated hardware platforms. In: PPoPP 2015 Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, San Francisco, February 2015
Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)
Matrix algebra on GPU and multicore architectures (MAGMA), MAGMA Release 1.6.1 (2015). http://icl.cs.utk.edu/magma/
Intel Pentium III Processor - Small Matrix Library (1999). http://www.intel.com/design/pentiumiii/sml/
Intel Math Kernel Library (2014). http://software.intel.com/intel-mkl/
Intel 64 and IA-32 architectures software developer’s manual, July 20 (2014). http://download.intel.com/products/processor/manual/
Keyes, D., Taylor, V.: NSF-ACCI task force on software for science and engineering, December 2010
Liao, J.C., Khodayari, A., Zomorrodi, A.R., Maranas, C.D.: A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25C, 50–62 (2014)
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part I. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)
Messer, O.E.B., Harris, J.A., Parete-Koon, S., Chertkow, M.A.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Manninen, P., Öster, P. (eds.) PARA. LNCS, vol. 7782, pp. 92–106. Springer, Heidelberg (2013)
Molero, J.M., Garzón, E.M., García, I., Quintana-Ortí, E.S, Plaza, A.: Poster: a batched Cholesky solver for local RX anomaly detection on GPUs. In: PUMPS (2013)
Nath, R., Tomov,S., Dong, T., Dongarra, T.: Optimizing symmetric dense matrix-vectormultiplication on GPUs. In: Proceedings of 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis, November 2011
Nath, R., Tomov, S., Dongarra, T.: Accelerating GPU kernels for dense linear algebra. In: VECPAR 2010 Proceedings of the 2009 International Meeting on High Performance Computing for Computational Science, pp. 22–25. Springer, Berkeley, June 2010
Nath, R., Tomov, S., Dongarra, J.: An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)
Nvidia visual profiler
https://developer.nvidia.com/nvidia-management-library-nvml (2014)
CUBLAS (2014). http://docs.nvidia.com/cuda/cublas/
CUBLAS 6.5, January 2015. http://docs.nvidia.com/cuda/cublas/
Villa, O., Fatica, M., Gawande, N., Tumeo, A.: Power/performance trade-offs of small batched LU based solvers on GPUs. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 813–825. Springer, Heidelberg (2013)
Nitin, V.O., Gawande, A., Tumeo, A.: Accelerating subsurface transport simulation on heterogeneous clusters. In: IEEE International Conference on Cluster Computing (CLUSTER 2013), pp. 23–27, Indiana, September 2013
Rotem, E., Naveh, A., Rajwan, D., Ananthakrishnan, A., Weissmann, E.: Power-management architecture of the intel microarchitecture code-named sandy bridge. IEEE Micro. 32(2), 20–27 (2012). doi:10.1109/MM.2012.12. ISSN: 0272–1732
Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parellel Comput. Syst. Appl. 36(5–6), 232–240 (2010). doi:10.1016/j.parco.2009.12.005
Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with GPU accelerators. In: Proceedings of the IEEE IPDPS 2010, pp. 1–8. IEEE Computer Society, Atlanta, 19–23 April 2010. doi:10.1109/IPDPSW.2010.5470941
Tomov, S., Dongarra, J.: Dense linear algebra for hybrid gpu-based systems. In: Kurzak, J., Bader, D.A., Dongarra, J. (eds.) Scientific Computing with Multicore and Accelerators. Chapman and Hall/CRC, UK (2010)
Wainwright, I .: Optimized LU-decomposition with full pivot for small batched matrices, GTC 2013 - ID S3069. April 2013
Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In: Proceedings of the International Conference on Computational Science, ICCS 2012, pp. 37–46. Procedia Computer Science, 9(0):37 (2012)
Yeralan, S.N., Davis, T.A., Ranka, S.: Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical report (2013)
Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. ACI-1339822, the Department of Energy, and Intel. The results were obtained in part with the financial support of the Russian Scientific Fund, Agreement N14-11-00190.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Haidar, A., Dong, T.T., Tomov, S., Luszczek, P., Dongarra, J. (2015). A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-20119-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)