A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

Yang, Wangdong; Li, Kenli; Li, Keqin

doi:10.1007/s11227-016-1881-x

A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

Published: 22 September 2016

Volume 73, pages 1760–1781, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Wangdong Yang¹,
Kenli Li¹ &
Keqin Li^1,2

1102 Accesses
Explore all metrics

Abstract

Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks above and below the main diagonal have little non-zero elements. Therefore, we present a solving method which mixes direct and iterative methods. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Furthermore, we analyze cloud resources scheduling model and obtain ten block-tridiagonal matrices which are produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved using our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

High Performance Portable Solver for Tridiagonal Toeplitz Systems of Linear Equations

A massively parallel algorithm for Bordered Almost Block Diagonal Systems on GPUs

Article 16 May 2020

References

Geer D (2005) Chip makers turn to multicore processors. Computer 38(5):11–13
Article Google Scholar
Thomas LH (1949) Elliptic problems in linear difference equations over a network. Watson Sci. Comput. Lab. Rept. Columbia University, New York
Google Scholar
Stone HS (1975) Parallel tridiagonal equation solvers. ACM Trans Math Softw 1:289–307
Article MathSciNet MATH Google Scholar
Heller D (1976) Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. SIAM J Numer Anal 13(4):484–496
Article MathSciNet MATH Google Scholar
Hirshman SP, Perumalla KS, Lynch VE, Sanchez R (2010) Bcyclic: a parallel block tridiagonal matrix cyclic solver. J Comput Phys 229(18):6392–6404
Article MathSciNet MATH Google Scholar
Lamas-Rodrıguez J, Heras D, Bóo M, Argüello F (2011) Tridiagonal system solvers internal report. Department of Electronics and Computer Science Internal Report, University of Santiago de Compostela, Spain
Google Scholar
Buzbee BL, Golub GH, Nielson CW (1970) On direct methods for solving poisson’s equations. SIAM J Numer Anal 7:627–656
Article MathSciNet MATH Google Scholar
Hockney RWA (1965) fast direct solution of Poisson’s equation using fourier analysis. J ACM 12:95–113
Article MathSciNet MATH Google Scholar
Stone HS (1973) An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. J ACM 20:27–38
Article MathSciNet MATH Google Scholar
Bondeli S (1990) Divide and conquer: a parallel algorithm for the solution of a tridiagonal linear system of equations. In: Joint International Conference on Vector and Parallel Processing, CONPAR 90, vol. IV. Springer, Berlin, pp 419–434
Wang HH (1981) A parallel method for tridiagonal equations. ACM Trans Math Softw 7:170–183
Article MathSciNet MATH Google Scholar
Lorenzo PAR, Müller A, Murakami Y, Wylie BJN (1996) High performance fortran interfacing to scalapack. In: Proceedings of the Third International Workshop on Applied Parallel Computing, Industrial Computation and Optimization, pp 457–466
Sanchez R, Hirshman S, Lynch V (2010) Siesta: an scalable island equilibrium solver for toroidal applications. American Physical Society, Providence
Google Scholar
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188
Article Google Scholar
Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(30):425–432
Article Google Scholar
Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with simd machine architectures. Comput Graphics Forum 6(1):3–11
Article Google Scholar
Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graphics Forum 8(8):3–11
Article Google Scholar
Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192
Article Google Scholar
Arabnia HR (1995) Distributed stereo-correlation algorithm. In: Proceedings of the International Conference on Computer Communications and Networks, pp 707–711
Bhandarkar SM, Arabnia HR, Smith JW (2011) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(2):201–229
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Article Google Scholar
Bhandarkar SM, Arabnia HR (1995) The refine multiprocessor theoretical properties and algorithms. Parallel Comput 21(11):1783–1805
Article Google Scholar
Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269
Article MATH Google Scholar
Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multiring network. J Supercomput 25(1):43–62
Article MATH Google Scholar
Thapliyal H, Srinivas MB, Arabnia HR (2005) A need of quantum computing: Reversible logic synthesis of parallel binary adder-subtractor. In: International Conference on Embedded Systems and Applications. ESA, Las Vegas
Thapliyal H, Arabnia H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for fpgas and its reversible logic implementation. Comput Sci 2:438–442
Google Scholar
Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4: 2 and 5: 2 compressors with minimum number of transistors designed for low-power operations. In: International Conference on Embedded Systems Applications, Las Vegas, pp 160–168
Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using fredkin and feynman gates for industrial electronics and applications. Computer Science
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for fpgas. Comput Sci
Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. Springer, Berlin
Book Google Scholar
Balasubramanian P, Edwards DA, Arabnia HR (2011) Robust asynchronous carry lookahead adders. In: International Conference on Computer Design, pp 321–324
Balasubramanian P, Arabnia HR, Arisaka R (2012) Rb_dsop: a rule based disjoint sum of products synthesis method. In: International Conference on Computer Design
Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder
Lee J, Wright JC (2014) A block-tridiagonal solver with two-level parallelization for finite element-spectral codes. Comput Phys Commun 185(10):2598–2608
Article Google Scholar
Ruggiero V, Galligani E (1992) A parallel algorithm for solving block tridiagonal linear systems. Comput Math Appl 24(4):15–21
Article MathSciNet MATH Google Scholar
Li HB, Huang TZ, Zhang Y, Liu XP, Li H (2009) On some new approximate factorization methods for block tridiagonal matrices suitable for vector and parallel processors. Math Comput Simul 79(7):2135–2147
Article MathSciNet MATH Google Scholar
Henk A, Vorst VD (2003) Iterative krylov methods for large linear systems, vol 13. Cambridge University Press, Cambridge xiv+221
MATH Google Scholar
Samarskii A A, Nikolaev E S (1989) Numerical methods for grid equations. Birkhäuser, Basel
Book Google Scholar
Varah JM (1972) On the solution of block-tridiagonal systems arising from certain finite-difference equations. Math Comput 26(120):859–868
Article MathSciNet MATH Google Scholar
Terekhov AV (2011) A fast parallel algorithm for solving block-tridiagonal systems of linear equations including the domain decomposition method. Parallel Comput 39(s 6–7):475–484
MathSciNet Google Scholar
Ruggiero V, Galligani E (1992) A parallel algorithm for solving block tridiagonal linear systems. Comput Math Appl 24(4):15–21
Article MathSciNet MATH Google Scholar
Gutknecht MH, Schmelzer T (2007) Updating the qr decomposition of block tridiagonal and block hessenberg matrices. Appl Numer Math 58(2008):871–883
MathSciNet MATH Google Scholar
Koulaei MH, Toutounian F (2007) On computing of block ilu preconditioner for block tridiagonal systems. J Comput Appl Math 202(2):248–257
Article MathSciNet MATH Google Scholar
Yang W, Li K, Liu Y, Shi L, Wang C (2014) Optimization of quasi diagonal matrix-vector multiplication on gpu. Int J High Perform Comput Appl 28(2):181–193
Article Google Scholar
Li K, Yang W, Li K (2015) Performance analysis and optimization for SPMV on GPU using probabilistic modeling. IEEE Trans Parallel Distrib Syst 26:196–205. doi:10.1109/TPDS.2014.2308221
Yang W, Li K, Mo Z, Li K (2015) Performance optimization using partitioned SPMV on GPUs and multicore cpus. IEEE Trans Comput 64(9):2623–2636
Article MathSciNet Google Scholar
DAzevedo E, Hill J C (2012) Parallel lu factorization on GPU cluster. Proc Comp Sci 9(11):67–75
Article Google Scholar
Tomov S (2012) A hybridization methodology for high-performance linear algebra software for GPUs, Chap 34. Elsevier, Amsterdam
Google Scholar
Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the plasma and magma projects. J Phys Conf Seri, p 012037
Davidson A, Zhang Y, Owens JD (2011) An auto-tuned method for solving large tridiagonal systems on the gpu. In: Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium, pp 956–965
Göddeke D, Strzodka R (2011) Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid. IEEE Trans Parallel Distrib Syst 22(1):22–32
Article Google Scholar
László E, Giles M, Appleyard J (2016) Manycore algorithms for batch scalar and block tridiagonal solvers. ACM Trans Math Softw 42(4):31:1–31:36
Article MathSciNet Google Scholar
NVIDIA (213) NVIDIA CUDA C programming guide, Tech. Rep
NVIDIA (2015) Cusolver library, Tech. Rep
NVIDIA (2015) Cusparse library, Tech. Rep
PARALUTION Labs UG & Co. KG (2015) Paralution—user manual, Tech. Rep., Gaggenau
Ziane Khodja L, Couturier R, Giersch A, Bahi J (2014) Parallel sparse linear solver with gmres method using minimization techniques of communications for gpu clusters. J Supercomput 69(1):200–224. doi:10.1007/s11227-014-1143-8
Article Google Scholar
Couturier R, Denis C, Jzquel F (2008) Gremlins: a large sparse linear solver for grid environment. Parallel Comput 34(6C8):380–391. Parallel Matrix Algorithms and Applications. http://www.sciencedirect.com/science/article/pii/S0167819107001354
Jezequel F, Couturier R, Denis C (2012) Solving large sparse linear systems in a grid environment: the gremlins code versus the petsc library. J Supercomput 59(3):1517–1532. doi:10.1007/s11227-011-0563-y
Article Google Scholar
Smith B (2001) PETSC: portable, extensible toolkit for scientific computation. Encyclopedia of Parallel Computing, pp 1530–1539
Householder AS (1964) The theory of matrices in numerical analysis. Dover, New York
MATH Google Scholar
Davis T A (2011) Algorithm 915, suitesparseqr: multifrontal multithreaded rank-revealing sparse qr factorization. ACM Trans Math Softw (TOMS) 38(1):8
MathSciNet Google Scholar
Davis TA, Yeralan SN, Ranka S (2015) Algorithm 9xx: sparse qr factorization on the GPU. ACM Trans Math Softw 1:1–28. doi:10.1145/0000000.0000000
Google Scholar

Download references

Acknowledgments

The authors deeply appreciate the anonymous reviewers for their comments on the manuscript. The research was partially funded by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005 and 61432005), the National Natural Science Foundation of China (Grant Nos. 61370095, 61472124, and 61572175), and the Science and technology project of Hunan Province (Grant No. 2015SK20062).

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, 410008, Hunan, China
Wangdong Yang, Kenli Li & Keqin Li
Department of Computer Science, State University of New York, New Paltz, NY, 12561, USA
Keqin Li

Authors

Wangdong Yang
View author publications
You can also search for this author inPubMed Google Scholar
Kenli Li
View author publications
You can also search for this author inPubMed Google Scholar
Keqin Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Wangdong Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, W., Li, K. & Li, K. A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems. J Supercomput 73, 1760–1781 (2017). https://doi.org/10.1007/s11227-016-1881-x

Download citation

Published: 22 September 2016
Issue Date: May 2017
DOI: https://doi.org/10.1007/s11227-016-1881-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

High Performance Portable Solver for Tridiagonal Toeplitz Systems of Linear Equations

A massively parallel algorithm for Bordered Almost Block Diagonal Systems on GPUs

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now