A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization

Chen, Yong; Jin, Hai; Zheng, Ran; Liu, Yuandong; Wang, Wei

doi:10.1007/s11265-017-1227-9

A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization

Published: 24 February 2017

Volume 90, pages 53–67, (2018)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yong Chen¹,
Hai Jin¹,
Ran Zheng¹,
Yuandong Liu¹ &
…
Wei Wang¹

342 Accesses
1 Citation
Explore all metrics

Abstract

In many scientific computing applications, sparse Cholesky factorization is used to solve large sparse linear equations in distributed environment. GPU computing is a new way to solve the problem. However, sparse Cholesky factorization on GPU is hardly to achieve excellent performance due to the structure irregularity of matrix and the low GPU resource utilization. A hybrid CPU-GPU implementation of sparse Cholesky factorization is proposed based on multifrontal method. A large sparse coefficient matrix is decomposed into a series of small dense matrices (frontal matrices) in the method, and then multiple GEMM (General Matrix-matrix Multiplication) operations are computed on them. GEMMs are the main operations in sparse Cholesky factorization, but they are hardly to perform better in parallel on GPU. In order to improve the performance, the scheme of multiple task queues is adopted to perform multiple GEMMs parallelized with multifrontal method; all GEMM tasks are scheduled dynamically on GPU and CPU based on computation scales for load balance and computing-time reduction. Experimental results show that the approach can outperform the implementations of cuBLAS, achieving up to 1.98× speedup on GTX460 (Fermi micro-architecture) and 3.06× speedup on K20m (Kepler micro-architecture), respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Multithreaded Multifrontal Sparse Cholesky Factorization Using Threading Building Blocks

Effective Minimally-Invasive GPU Acceleration of Distributed Sparse Matrix Factorization

Data Driven Scheduling Approach for the Multi-node Multi-GPU Cholesky Decomposition

References

Heath, MT, Ng, E, & Peyton, BW (1991). Parallel algorithms for sparse linear systems. SIAM Review, 33(3), 420–460.
Article MathSciNet MATH Google Scholar
Demmel, JW, Gilbert, JR, & Li, XS (1999). An asynchronous parallel supernodal algorithm for sparse gaussian elimination. SIAM Journal on Matrix Analysis and Applications, 20(4), 915–952.
Article MathSciNet MATH Google Scholar
Avron, H, Shklarski, G, & Toledo, S (2008). Parallel unsymmetric-pattern multifrontal sparse LU with column preorderin. ACM Transactions on Mathematical Software (TOMS), 34(2), 8.
Article MATH Google Scholar
Liu, JWH (1992). The multifrontal method for sparse matrix solution: Theory and practice. SIAM Review, 34(1), 82–109.
Article MathSciNet MATH Google Scholar
Li, SG, Hu, CJ, Zhang, JC, & et al. (2015). Automatic tuning of sparse matrix-vector multiplication on multicore clusters. Science China Information Sciences, 58(9), 1–14.
Google Scholar
Nvidia, CUDA. Cublas library. http://docs.nvidia.com/cuda/cublas/#axzz47fgrxXqP.
Dongarra, JJ, Du Croz, J, Hammarling, S, & et al. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1), 1–17.
Article MATH Google Scholar
Li, X, Li, F, & Clark, JM (2013). Exploration of multifrontal method with GPU in power flow computation. In Power and Energy Society General Meeting (PES), 2013 IEEE (pp. 1–5). IEEE.
Sao, P, Vuduc, R, & Li, XS (2014). A distributed, CPU-GPU sparse direct solver. In Euro-par 2014 parallel processing (pp. 487–498). Springer International Publishing.
Schenk, O, Christen, M, & Burkhart, H (2008). Algorithmic performance studies on graphics processing units. Journal of Parallel and Distributed Computing, 68(10), 1360–1369.
Article Google Scholar
Nvidia. Fermi architecture. https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
Nvidia. Kepler gk110. https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
Nvidia. Kepler architecture. http://www.nvidia.com/object/nvidia-kepler.html.
Toledo, S, Chen, D, & Rotkin, V. Taucs: a library of sparse linear solvers. http://www.tau.ac.il/stoledo/taucs/.
George, T, Saxena, V, Gupta, A, & et al. (2011). Multifrontal factorization of sparse SPD matrices on GPUs. Parallel & Distributed Processing Symposium (IPDPS), 2011, IEEE International (pp. 372–383). IEEE.
Gupta, A. (2000). WSMP watson sparse matrix package (Part-I: direct solution of symmetric sparse systems). Yorktown Heights: IBM TJ Watson Research Center. Tech. Rep RC.
Google Scholar
Li, XS, & Demmel, JW (2003). SuperLU_DIST: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software (TOMS), 29(2), 110–140.
Article MATH Google Scholar
Lucas, RF, Wagenbreth, G, Davis, DM, & et al. (2010). Multifrontal computations on GPUs and their multi-core hosts. High Performance Computing for Computational ScienceCVECPAR 2010, (pp. 71–82). Berlin: Springer.
MATH Google Scholar
Yu, CD, Wang, W, & Pierce, D (2011). A CPU-GPU hybrid approach for the unsymmetric multifrontal method. Parallel Computing, 37(12), 759–770.
Article Google Scholar
Davis, TA (2004). Algorithm 832: UMFPACK V4. 3—an unsymmetric-pattern multifrontal method. ACM Transactions on Mathematical Software (TOMS), 30(2), 196–199.
Article MathSciNet MATH Google Scholar
Lebedev, S, Akhmedzhanov, D, Kozinov, E, & et al. (2015). Dynamic parallelization strategies for multifrontal sparse cholesky factorization. Parallel computing technologies (pp. 68–79). Springer International Publishing.
Rennich, SC, Stosic, D, & Davis, TA (2014). Accelerating sparse cholesky factorization on GPUs. Proceedings of the Fourth Workshop on Irregular Applications: Architectures and Algorithms (pp. 9–16). IEEE Press.
Kim, K, & Eijkhout, V (2013). Scheduling a parallel sparse direct solver to multiple GPUs. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, (pp. 1401–1408).
Zou, D, Dou, Y, Guo, S, & et al. (2014). Supernodal sparse cholesky factorization on graphics processing units. Concurrency and Computation: Practice and Experience, 26(16), 2713–2726.
Article Google Scholar
Yeralan, SN, Davis, TA, & Ranka, S (2013). Sparse QR factorization on gpu architectures. University of Florida. Tech. Rep.
Ren, L, Chen, X, Wang, Y, & et al. (2012). Sparse LU factorization for parallel circuit simulation on GPU. In Proceedings of the 49th Annual Design Automation Conference (pp. 1125–1130), ACM.
MIT CSAIL Supertech Research Group. Cilk: A linguistic and runtime technology for algorithmic multithreaded programming. http://supertech.csail.mit.edu/cilk/.
Chen, L, Villa, O, Krishnamoorthy, S, & et al. (2010). Dynamic load balancing on single-and multi-GPU systems. In IEEE international symposium on parallel & distributed processing (IPDPS), 2010 (pp. 1–12). IEEE.
Davis, TA, & Hu, Y. The University of Florida sparse matrix collection. http://www.cise.ufl.edu/research/sparse/matrices/.
Wang, H, Wang, R, Luan, ZZ, & et al. (2015). Improving multiprocessor performance with fine-grain coherence bypass. Science China Information Sciences, 58(1), 1–15.
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (grant No. 61133008) and the National Basic Research Program (973 Program) (grant No. 2013CB2282036).

Author information

Authors and Affiliations

Services Computing Technology and System Lab, Big Data Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Yong Chen, Hai Jin, Ran Zheng, Yuandong Liu & Wei Wang

Authors

Yong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
Ran Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yuandong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran Zheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Jin, H., Zheng, R. et al. A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization. J Sign Process Syst 90, 53–67 (2018). https://doi.org/10.1007/s11265-017-1227-9

Download citation

Received: 04 May 2016
Revised: 17 January 2017
Accepted: 25 January 2017
Published: 24 February 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11265-017-1227-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization

Abstract

Access this article

Similar content being viewed by others

Multithreaded Multifrontal Sparse Cholesky Factorization Using Threading Building Blocks

Effective Minimally-Invasive GPU Acceleration of Distributed Sparse Matrix Factorization

Data Driven Scheduling Approach for the Multi-node Multi-GPU Cholesky Decomposition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization

Abstract

Access this article

Similar content being viewed by others

Multithreaded Multifrontal Sparse Cholesky Factorization Using Threading Building Blocks

Effective Minimally-Invasive GPU Acceleration of Distributed Sparse Matrix Factorization

Data Driven Scheduling Approach for the Multi-node Multi-GPU Cholesky Decomposition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation