Skip to main content

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Included in the following conference series:

Abstract

As modern hardware keeps evolving, an increasingly effective approach to developing energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. This is in contrast to the hybrid CPU-GPU algorithms that rely heavily on using the multicore CPU for specific parts of the workload. But for a system to benefit fully from the GPU’s significantly higher energy efficiency, avoiding the use of the multicore CPU must be a primary design goal, so the system can rely more heavily on the more efficient GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis, and the use of profiling and tracing tools, guided the development and optimization of our batched factorization to achieve up to a 2-fold speedup and a 3-fold energy efficiency improvement compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to \(5\times \) speedup on the K40 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Historically, similar issues were associated with strong scaling [14] and were attributed to a fixed problem size.

References

  1. Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys.: Conf. Ser. 180(1), 012037 (2009)

    Google Scholar 

  2. Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.: Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUS. In: Hwu, W.W. (ed.) GPU Computing Gems. Morgan Kaufmann, California (2010)

    Google Scholar 

  3. Agullo, E., Dongarra, J., Nath, R.,Tomov, S.: Fully empirical autotuned qr factorization for multicore architectures (2011). CoRR, abs/1102.5328

    Google Scholar 

  4. ACML - AMD Core Math Library (2014). http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml

  5. Anderson, M.J., Sheffield, D., Keutzer. K.: A predictive model for solving small linear algebra problems in gpu registers. In: IEEE 26th International Parallel Distributed Processing Symposium (IPDPS) (2012)

    Google Scholar 

  6. Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P., Tomov, S.: The impact of multicore on math software. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 1–10. Springer, Heidelberg (2007)

    Google Scholar 

  7. Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: high performance dense linear algebra with OpenCL. In: The ACM International Conference Series, Atlanta, May 13–14 (2013). (submitted)

    Google Scholar 

  8. Dong, T., Haidar, A., Luszczek, P., Harris, A., Tomov, S., Dongarra, J.: LU factorization of small matrices: accelerating batched DGETRF on the GPU. In: Proceedings of 16th IEEE International Conference on High Performance and Communications (HPCC 2014), August 2014

    Google Scholar 

  9. Dong, T., Haidar, A., Tomov, S., Dongarra, J.: A fast batched cholesky factorization on a GPU. In: Proceedings of 2014 International Conference on Parallel Processing (ICPP-2014), September 2014

    Google Scholar 

  10. Dong, T., Dobrev, V., Kolev, T., Rieben, R., Tomov, S., Dongarra, J.: A step towards energy efficient computing: redesigning a hydrodynamic application on CPU-GPU. In: IEEE 28th International Parallel Distributed Processing Symposium (IPDPS) (2014)

    Google Scholar 

  11. Dongarra, J., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S., YarKhan, A.: Model-driven one-sided factorizations on multicore accelerated systems. Int. J.Supercomputing Frontiers Innovations 1(1), 85 (2014)

    Google Scholar 

  12. Peng, D., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)

    Article  Google Scholar 

  13. Oak Ridge Leadership Computing Facility. Annual report 2013–2014 (2014). https://www.olcf.ornl.gov/wp-content/uploads/2015/01/AR_2014_Small.pdf

  14. Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)

    Article  Google Scholar 

  15. Haidar, A., Tomov, S., Dongarra, J., Solca, R., Schulthess, T.: A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks. Int. J. High Perform. Comput. Appl. 28(2), 196–209 (2012)

    Article  Google Scholar 

  16. Haidar, A., Cao, C., Yarkhan, A., Luszczek, P., Tomov, S., Kabir, K., Dongarra, J.: Unified development for mixed multi-gpu and multi-coprocessor environments using a lightweight runtime environment. In: IPDPS 2014 Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 491–500. IEEE Computer Society, Washington, (2014)

    Google Scholar 

  17. Haidar, A., Dong, T., Luszczek, P., Tomov, S., Dongarra, J.: Batched matrix computations on hardware accelerators based on GPUs. Int. J. High Performance Comput. Appl. 18(1), 135–158 (2015). doi:10.1177/1094342014567546

    Google Scholar 

  18. Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.: Optimization for performance and energy for batched matrix computations on GPUs. In: PPoPP 2015 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, ACM, San Francisco, February 2015

    Google Scholar 

  19. Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.: Towards batched linear solvers on accelerated hardware platforms. In: PPoPP 2015 Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, San Francisco, February 2015

    Google Scholar 

  20. Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)

    Article  Google Scholar 

  21. Matrix algebra on GPU and multicore architectures (MAGMA), MAGMA Release 1.6.1 (2015). http://icl.cs.utk.edu/magma/

  22. Intel Pentium III Processor - Small Matrix Library (1999). http://www.intel.com/design/pentiumiii/sml/

  23. Intel Math Kernel Library (2014). http://software.intel.com/intel-mkl/

  24. Intel 64 and IA-32 architectures software developer’s manual, July 20 (2014). http://download.intel.com/products/processor/manual/

  25. Keyes, D., Taylor, V.: NSF-ACCI task force on software for science and engineering, December 2010

    Google Scholar 

  26. Liao, J.C., Khodayari, A., Zomorrodi, A.R., Maranas, C.D.: A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metab. Eng. 25C, 50–62 (2014)

    Google Scholar 

  27. Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part I. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  28. Messer, O.E.B., Harris, J.A., Parete-Koon, S., Chertkow, M.A.: Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Manninen, P., Öster, P. (eds.) PARA. LNCS, vol. 7782, pp. 92–106. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  29. Molero, J.M., Garzón, E.M., García, I., Quintana-Ortí, E.S, Plaza, A.: Poster: a batched Cholesky solver for local RX anomaly detection on GPUs. In: PUMPS (2013)

    Google Scholar 

  30. Nath, R., Tomov,S., Dong, T., Dongarra, T.: Optimizing symmetric dense matrix-vectormultiplication on GPUs. In: Proceedings of 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis, November 2011

    Google Scholar 

  31. Nath, R., Tomov, S., Dongarra, T.: Accelerating GPU kernels for dense linear algebra. In: VECPAR 2010 Proceedings of the 2009 International Meeting on High Performance Computing for Computational Science, pp. 22–25. Springer, Berkeley, June 2010

    Google Scholar 

  32. Nath, R., Tomov, S., Dongarra, J.: An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl. 24(4), 511–515 (2010)

    Article  Google Scholar 

  33. Nvidia visual profiler

    Google Scholar 

  34. https://developer.nvidia.com/nvidia-management-library-nvml (2014)

  35. CUBLAS (2014). http://docs.nvidia.com/cuda/cublas/

  36. CUBLAS 6.5, January 2015. http://docs.nvidia.com/cuda/cublas/

  37. Villa, O., Fatica, M., Gawande, N., Tumeo, A.: Power/performance trade-offs of small batched LU based solvers on GPUs. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 813–825. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  38. Nitin, V.O., Gawande, A., Tumeo, A.: Accelerating subsurface transport simulation on heterogeneous clusters. In: IEEE International Conference on Cluster Computing (CLUSTER 2013), pp. 23–27, Indiana, September 2013

    Google Scholar 

  39. Rotem, E., Naveh, A., Rajwan, D., Ananthakrishnan, A., Weissmann, E.: Power-management architecture of the intel microarchitecture code-named sandy bridge. IEEE Micro. 32(2), 20–27 (2012). doi:10.1109/MM.2012.12. ISSN: 0272–1732

    Article  Google Scholar 

  40. Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parellel Comput. Syst. Appl. 36(5–6), 232–240 (2010). doi:10.1016/j.parco.2009.12.005

    Article  MATH  Google Scholar 

  41. Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with GPU accelerators. In: Proceedings of the IEEE IPDPS 2010, pp. 1–8. IEEE Computer Society, Atlanta, 19–23 April 2010. doi:10.1109/IPDPSW.2010.5470941

  42. Tomov, S., Dongarra, J.: Dense linear algebra for hybrid gpu-based systems. In: Kurzak, J., Bader, D.A., Dongarra, J. (eds.) Scientific Computing with Multicore and Accelerators. Chapman and Hall/CRC, UK (2010)

    Google Scholar 

  43. Wainwright, I .: Optimized LU-decomposition with full pivot for small batched matrices, GTC 2013 - ID S3069. April 2013

    Google Scholar 

  44. Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In: Proceedings of the International Conference on Computational Science, ICCS 2012, pp. 37–46. Procedia Computer Science, 9(0):37 (2012)

    Google Scholar 

  45. Yeralan, S.N., Davis, T.A., Ranka, S.: Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical report (2013)

    Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. ACI-1339822, the Department of Energy, and Intel. The results were obtained in part with the financial support of the Russian Scientific Fund, Agreement N14-11-00190.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Azzam Haidar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Haidar, A., Dong, T.T., Tomov, S., Luszczek, P., Dongarra, J. (2015). A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20119-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20118-4

  • Online ISBN: 978-3-319-20119-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics