A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

Ltaief, Hatem; Tomov, Stanimire; Nath, Rajib; Du, Peng; Dongarra, Jack

doi:10.1007/978-3-642-19328-6_11

Hatem Ltaief²⁰,
Stanimire Tomov²⁰,
Rajib Nath²⁰,
Peng Du²⁰ &
…
Jack Dongarra²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6449))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

1646 Accesses
6 Citations

Abstract

We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs’ compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA’s Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

NVIDIA CUDA Compute Unified Device Architecture - Programming Guide (2007), http://developer.download.nvidia.com
NVIDIA CUDA ZONE, http://www.nvidia.com/object/cuda_home.html
Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: Proc. of SC 2008, Piscataway, NJ, USA, pp. 1–11 (2008)
Google Scholar
Tomov, S., Dongarra, J.: Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing. LAPACK Working Note 219 (May 2009)
Google Scholar
CUDA CUBLAS Library, http://developer.download.nvidia.com
Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., YarKhan, A.: PLASMA version 2.0 user guide (2009), http://icl.cs.utk.edu/plasma
Kurzak, J., Buttari, A., Dongarra, J.J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Transactions on Parallel and Distributed Systems 19(9), 1–11 (2008)
Article Google Scholar
Baboulin, M., Dongarra, J., Tomov, S.: Some issues in dense linear algebra for multicore and special purpose architectures. LAPACK Working Note 200 (May 2008)
Google Scholar
Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA version 0.2 User Guide (November 2009), http://icl.cs.utk.edu/magma
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for gPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)
Chapter Google Scholar
Ayguadé, E., Badia, R., Igual, F., Labarta, J., Mayo, R., Quintana-Ortí, E.: An extension of the starSs programming model for platforms with multiple gPUs. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009)
Chapter Google Scholar
Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense Linear Algebra Solvers for Multicore with GPU Accelerators. In: Proceedings of IPDPS 2010, Atlanta, GA (April 2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, USA
Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du & Jack Dongarra

Authors

Hatem Ltaief
View author publications
You can also search for this author in PubMed Google Scholar
Stanimire Tomov
View author publications
You can also search for this author in PubMed Google Scholar
Rajib Nath
View author publications
You can also search for this author in PubMed Google Scholar
Peng Du
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculdade de Engenharia da, Universidade do Porto, Rua Dr. Roberto Frias s/n, 4200-465, Porto, Portugal
José M. Laginha M. Palma
INP (ENSEEIHT) IRIT, University of Toulouse, rue Charles-Camichel, CEDEX 7, 31071, Toulouse, France
Michel Daydé
Lawrence Berkeley National Laboratory, Berkeley, USA
Osni Marques
Faculty of Engineering, University of Porto, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal
João Correia Lopes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ltaief, H., Tomov, S., Nath, R., Du, P., Dongarra, J. (2011). A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds) High Performance Computing for Computational Science – VECPAR 2010. VECPAR 2010. Lecture Notes in Computer Science, vol 6449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19328-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-19328-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19327-9
Online ISBN: 978-3-642-19328-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics