Abstract
We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs’ compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA’s Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
NVIDIA CUDA Compute Unified Device Architecture - Programming Guide (2007), http://developer.download.nvidia.com
NVIDIA CUDA ZONE, http://www.nvidia.com/object/cuda_home.html
Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: Proc. of SC 2008, Piscataway, NJ, USA, pp. 1–11 (2008)
Tomov, S., Dongarra, J.: Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing. LAPACK Working Note 219 (May 2009)
CUDA CUBLAS Library, http://developer.download.nvidia.com
Agullo, E., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Langou, J., Ltaief, H., Luszczek, P., YarKhan, A.: PLASMA version 2.0 user guide (2009), http://icl.cs.utk.edu/plasma
Kurzak, J., Buttari, A., Dongarra, J.J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Transactions on Parallel and Distributed Systems 19(9), 1–11 (2008)
Baboulin, M., Dongarra, J., Tomov, S.: Some issues in dense linear algebra for multicore and special purpose architectures. LAPACK Working Note 200 (May 2008)
Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA version 0.2 User Guide (November 2009), http://icl.cs.utk.edu/magma
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for gPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 884–892. Springer, Heidelberg (2009)
Ayguadé, E., Badia, R., Igual, F., Labarta, J., Mayo, R., Quintana-Ortí, E.: An extension of the starSs programming model for platforms with multiple gPUs. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009)
Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense Linear Algebra Solvers for Multicore with GPU Accelerators. In: Proceedings of IPDPS 2010, Atlanta, GA (April 2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ltaief, H., Tomov, S., Nath, R., Du, P., Dongarra, J. (2011). A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds) High Performance Computing for Computational Science – VECPAR 2010. VECPAR 2010. Lecture Notes in Computer Science, vol 6449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19328-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-19328-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19327-9
Online ISBN: 978-3-642-19328-6
eBook Packages: Computer ScienceComputer Science (R0)