Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory

Zhao, Di; Yu, Jinhang

doi:10.1007/s11227-014-1299-2

Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory

Published: 20 September 2014

Volume 71, pages 369–390, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Di Zhao^1,2 &
Jinhang Yu³

371 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

The tri-diagonal system comes from dynamic problems such as fluid simulation, and high efficiency is important for the success of these applications. In this paper, we develop completely GPU shared memory-based chunked cyclic reduction under the constraint of the capacity of the shared memory. Computational results show that GPU shared memory chunked cyclic reduction exhibits high efficiency by Nvidia TITAN with 48k shared memory, and GPU shared memory chunked cyclic reduction can solve a tri-diagonal system with 262,144-by-262,144 coefficient matrix in 1.768 ms. Computational results also show that GPU shared memory chunked cyclic reduction scales well to the sizes of coefficient matrix and the reduced systems. Altogether, since building completely on GPU shared memory, our solver may be faster than existing GPU solvers because of the efficiency of GPU shared memory, though the solubility of our solver is smaller than existing GPU solvers because of the capacity constraint of shared memory, where solubility means the solvable tri-diagonal system with the maximum size of the coefficient matrix by our solver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Article 17 August 2023

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

Article 16 December 2016

References

Golub GH, Van Loan CF (1996) Matrix computations. Johns Hopkins University Press, Baltimore
MATH Google Scholar
Niemeyer K, Sung C-J (2014) Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. J Supercomput 67(2):528–564
Wang Y et al (2013) A parallel solver for incompressible fluid flows. Procedia Comput Sci 18:439–448
Article Google Scholar
Wei Z et al (2013) Parallelizing alternating direction implicit solver on GPUs. Procedia Comput Sci 18:389–398
Article Google Scholar
Curnier A (1994) Computational methods in solid mechanics. Kluwer Academic, Dordrecht
Book MATH Google Scholar
Fung Y, Tong P (2001) Classical and computational solid mechanics. World Scientific, Singapore
Book MATH Google Scholar
Bathe KJ (2001) Computational fluid and solid mechanics. Elsevier, Amsterdam
MATH Google Scholar
Rylander T, Bondeson A, Ingelström P (2012) Computational electromagnetics. Springer, Berlin
Google Scholar
Sheng XQ, Song W (2012) Essentials of computational electromagnetics. Wiley, New York
Book Google Scholar
Levy G (2004) Computational finance: numerical methods for pricing financial instruments. Elsevier Butterworth-Heinemann, Oxford
Google Scholar
Los CA (2001) Computational finance: a scientific perspective. World Scientific, Singapore
Google Scholar
Duan JC, Härdle W, Gentle JE (2011) Handbook of computational finance. Springer, Berlin
Google Scholar
Levy G (2008) Computational finance using C and C#. Elsevier, Amsterdam
Google Scholar
Lyuu YD (2002) Financial engineering and computation: principles, mathematics, algorithms. Cambridge University Press, Cambridge
Google Scholar
Nguyuen H, Corporation N (2008) GPU Gems 3. Addison Wesley Professional, Reading
Google Scholar
Pharr M, Fernando R (2005) GPU Gems 2: programming techniques for high-performance graphics and general-purpose computation. Pearson Addison Wesley Professional, Reading
Google Scholar
Hockney RW, Jesshope CR (1988) Parallel computers 2: architecture, programming, and algorithms. A. Hilger, London
MATH Google Scholar
Sweet R (1988) A parallel and vector variant of the cyclic reduction algorithm. SIAM J Sci Stat Comput 9(4):761–765
Article MATH MathSciNet Google Scholar
Amodio P, Mastronardi N (1993) A parallel version of the cyclic reduction algorithm on a hypercube. Parallel Comput 19(11):1273–1281
Article MATH MathSciNet Google Scholar
Mattor N, Williams TJ, Hewett DW (1995) Algorithm for solving tri-diagonal matrix problems in parallel. Parallel Comput 21(11):1769–1782
Article MathSciNet Google Scholar
Stone HS (1975) Parallel tri-diagonal equation solvers. ACM Trans Math Softw 1(4):289–307
Article MATH Google Scholar
Schwandt H (1989) Cyclic reduction for tri-diagonal systems of equations with interval coefficients on vector computers. SIAM J Numer Anal 26(3):661–680
Article MATH MathSciNet Google Scholar
Allmann S, Rauber T, Runger G (2001) Cyclic reduction on distributed shared memory machines. In: Proceedings of ninth Euromicro workshop on parallel and distributed processing, 2001
Bekakos MP, Evans DJ (1993) Parallel cyclic odd–even reduction algorithms for solving Toeplitz tri-diagonal equations on MIMD computers. Parallel Comput 19(5):545–561
Article MATH Google Scholar
Gallopoulos E, Saad Y (1989) A parallel block cyclic reduction algorithm for the fast solution of elliptic equations. Parallel Comput 10(2):143–159
Article MATH MathSciNet Google Scholar
Sweet R (1977) A cyclic reduction algorithm for solving block tri-diagonal systems of arbitrary dimension. SIAM J Numer Anal 14(4):706–720
Article MATH MathSciNet Google Scholar
Seal SK, Perumalla KS, Hirshman SP (2013) Revisiting parallel cyclic reduction and parallel prefix-based algorithms for block tri-diagonal systems of equations. J Parallel Distrib Comput 73(2):273–280
Article MATH Google Scholar
Wang HH (1981) A parallel method for tri-diagonal equations. ACM Trans Math Softw 7(2):170–183
Article MATH Google Scholar
Stone HS (1973) An efficient parallel algorithm for the solution of a tri-diagonal linear system of equations. J ACM 20(1):27–38
Article MATH Google Scholar
Bondeli S, Gander W (1994) Cyclic reduction for special tri-diagonal systems. SIAM J Matrix Anal Appl 15(1):321–330
Article MATH MathSciNet Google Scholar
Xian-he S, Zhang H, Ni LM (1992) Efficient tri-diagonal solvers on multicomputers. IEEE Trans Comput 41(3):286–296
Article MathSciNet Google Scholar
Argüello F et al (2012) The split-and-merge method in general purpose computation on GPUs. Parallel Comput 38(6–7):277–288
Article Google Scholar
Owens JD et al (2008) GPU computing. Proc IEEE 96(5):879–899
Article Google Scholar
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, Austin, pp 1–11
Zhang Y, Cohen J, Owens JD (2010) Fast tri-diagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, Bangalore, pp 127–136
Zhang Y, Cohen J, Davidson AA, Owens JD (2011) A hybrid method for solving tri-diagonal systems on the GPU. In: W-mW Hwu (ed) GPU computing gems, vol 2, chap 11. Morgan Kaufmann, Los Altos, pp 117–132
Zhang Y (2009) Fast tridiagonal solvers on GPU. In: GPU technology conference. San Jose, California
Davidson A, Yao Z, Owens JD (2011) An auto-tuned method for solving large tri-diagonal systems on the GPU. In: IEEE international symposium on parallel and distributed processing (IPDPS), 2011
Davidson A, Owens JD (2011) Register packing for cyclic reduction: a case study. In: Proceedings of the fourth workshop on general purpose processing on graphics processing units. ACM, Newport Beach, pp 1–6
Chang L-W et al (2012) A scalable, numerically stable, high-performance tri-diagonal solver using GPUs. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, Salt Lake City, pp 1–11
Hee-Seok K et al. (2011) A scalable tri-diagonal solver for GPUs. In: International conference on parallel processing (ICPP), 2011
Cuda C Programming, Version Guide, 5.5. (2013) Nvidia, Santa Clara
Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming. Pearson Education, Boston
Google Scholar
Cook S (2013) CUDA programming: a developer’s guide to parallel computing with GPUs. Morgan Kaufmann, Los Altos
Google Scholar
Farber R (2011) CUDA application design and development. Morgan Kaufmann, Los Altos
Google Scholar
Wilt N (2013) The CUDA handbook: a comprehensive guide to GPU programming. Pearson Education, Boston
Google Scholar
Goeddeke D, Strzodka R (2011) Cyclic reduction tri-diagonal solvers on GPUs applied to mixed-precision multigrid. IEEE Trans Parallel Distrib Syst 22(1):22–32
Article Google Scholar
Karniadakis GE, Kirby RM (2003) Parallel scientific computing in c++ and mpi: a seamless approach to parallel algorithms and their implementation. Cambridge University Press, Cambridge
Book Google Scholar
Swarztrauber PN (1979) A parallel algorithm for solving general tri-diagonal equations. Math Comput 33(145):185–199
Article MATH MathSciNet Google Scholar
Lin HX (2001) A unifying graph model for designing parallel algorithms for tri-diagonal systems. Parallel Comput 27(7):925–939
Article MATH Google Scholar

Download references

Acknowledgments

We thank Exxact Corporation for providing usage time of Tesla K20 GPU through Nvidia’s program of GPU Test Drive. We thank reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Center for Cognitive and Brain Science, The Ohio State University, Columbus, OH, 43210, USA
Di Zhao
College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
Di Zhao
Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, OH, 43210, USA
Jinhang Yu

Authors

Di Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jinhang Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Zhao.

Appendix

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, D., Yu, J. Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory. J Supercomput 71, 369–390 (2015). https://doi.org/10.1007/s11227-014-1299-2

Download citation

Published: 20 September 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s11227-014-1299-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiently solving tri-diagonal system by chunked cyclic reduction and single-GPU shared memory

Abstract

Access this article

Similar content being viewed by others

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

References

Acknowledgments