Skip to main content
Log in

A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks above and below the main diagonal have little non-zero elements. Therefore, we present a solving method which mixes direct and iterative methods. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Furthermore, we analyze cloud resources scheduling model and obtain ten block-tridiagonal matrices which are produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved using our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Geer D (2005) Chip makers turn to multicore processors. Computer 38(5):11–13

    Article  Google Scholar 

  2. Thomas LH (1949) Elliptic problems in linear difference equations over a network. Watson Sci. Comput. Lab. Rept. Columbia University, New York

    Google Scholar 

  3. Stone HS (1975) Parallel tridiagonal equation solvers. ACM Trans Math Softw 1:289–307

    Article  MathSciNet  MATH  Google Scholar 

  4. Heller D (1976) Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. SIAM J Numer Anal 13(4):484–496

    Article  MathSciNet  MATH  Google Scholar 

  5. Hirshman SP, Perumalla KS, Lynch VE, Sanchez R (2010) Bcyclic: a parallel block tridiagonal matrix cyclic solver. J Comput Phys 229(18):6392–6404

    Article  MathSciNet  MATH  Google Scholar 

  6. Lamas-Rodrıguez J, Heras D, Bóo M, Argüello F (2011) Tridiagonal system solvers internal report. Department of Electronics and Computer Science Internal Report, University of Santiago de Compostela, Spain

    Google Scholar 

  7. Buzbee BL, Golub GH, Nielson CW (1970) On direct methods for solving poisson’s equations. SIAM J Numer Anal 7:627–656

    Article  MathSciNet  MATH  Google Scholar 

  8. Hockney RWA (1965) fast direct solution of Poisson’s equation using fourier analysis. J ACM 12:95–113

    Article  MathSciNet  MATH  Google Scholar 

  9. Stone HS (1973) An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. J ACM 20:27–38

    Article  MathSciNet  MATH  Google Scholar 

  10. Bondeli S (1990) Divide and conquer: a parallel algorithm for the solution of a tridiagonal linear system of equations. In: Joint International Conference on Vector and Parallel Processing, CONPAR 90, vol. IV. Springer, Berlin, pp 419–434

  11. Wang HH (1981) A parallel method for tridiagonal equations. ACM Trans Math Softw 7:170–183

    Article  MathSciNet  MATH  Google Scholar 

  12. Lorenzo PAR, Müller A, Murakami Y, Wylie BJN (1996) High performance fortran interfacing to scalapack. In: Proceedings of the Third International Workshop on Applied Parallel Computing, Industrial Computation and Optimization, pp 457–466

  13. Sanchez R, Hirshman S, Lynch V (2010) Siesta: an scalable island equilibrium solver for toroidal applications. American Physical Society, Providence

    Google Scholar 

  14. Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188

    Article  Google Scholar 

  15. Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(30):425–432

    Article  Google Scholar 

  16. Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with simd machine architectures. Comput Graphics Forum 6(1):3–11

    Article  Google Scholar 

  17. Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graphics Forum 8(8):3–11

    Article  Google Scholar 

  18. Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192

    Article  Google Scholar 

  19. Arabnia HR (1995) Distributed stereo-correlation algorithm. In: Proceedings of the International Conference on Computer Communications and Networks, pp 707–711

  20. Bhandarkar SM, Arabnia HR, Smith JW (2011) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(2):201–229

    Article  Google Scholar 

  21. Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114

    Article  Google Scholar 

  22. Bhandarkar SM, Arabnia HR (1995) The refine multiprocessor theoretical properties and algorithms. Parallel Comput 21(11):1783–1805

    Article  Google Scholar 

  23. Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269

    Article  MATH  Google Scholar 

  24. Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multiring network. J Supercomput 25(1):43–62

    Article  MATH  Google Scholar 

  25. Thapliyal H, Srinivas MB, Arabnia HR (2005) A need of quantum computing: Reversible logic synthesis of parallel binary adder-subtractor. In: International Conference on Embedded Systems and Applications. ESA, Las Vegas

  26. Thapliyal H, Arabnia H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for fpgas and its reversible logic implementation. Comput Sci 2:438–442

    Google Scholar 

  27. Gopineedi PD, Thapliyal H, Srinivas MB, Arabnia HR (2006) Novel and efficient 4: 2 and 5: 2 compressors with minimum number of transistors designed for low-power operations. In: International Conference on Embedded Systems Applications, Las Vegas, pp 160–168

  28. Thapliyal H, Arabnia HR (2006) Reversible programmable logic array (RPLA) using fredkin and feynman gates for industrial electronics and applications. Computer Science

  29. Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for fpgas. Comput Sci

  30. Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. Springer, Berlin

    Book  Google Scholar 

  31. Balasubramanian P, Edwards DA, Arabnia HR (2011) Robust asynchronous carry lookahead adders. In: International Conference on Computer Design, pp 321–324

  32. Balasubramanian P, Arabnia HR, Arisaka R (2012) Rb_dsop: a rule based disjoint sum of products synthesis method. In: International Conference on Computer Design

  33. Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder

  34. Lee J, Wright JC (2014) A block-tridiagonal solver with two-level parallelization for finite element-spectral codes. Comput Phys Commun 185(10):2598–2608

    Article  Google Scholar 

  35. Ruggiero V, Galligani E (1992) A parallel algorithm for solving block tridiagonal linear systems. Comput Math Appl 24(4):15–21

    Article  MathSciNet  MATH  Google Scholar 

  36. Li HB, Huang TZ, Zhang Y, Liu XP, Li H (2009) On some new approximate factorization methods for block tridiagonal matrices suitable for vector and parallel processors. Math Comput Simul 79(7):2135–2147

    Article  MathSciNet  MATH  Google Scholar 

  37. Henk A, Vorst VD (2003) Iterative krylov methods for large linear systems, vol 13. Cambridge University Press, Cambridge xiv+221

    MATH  Google Scholar 

  38. Samarskii A A, Nikolaev E S (1989) Numerical methods for grid equations. Birkhäuser, Basel

    Book  Google Scholar 

  39. Varah JM (1972) On the solution of block-tridiagonal systems arising from certain finite-difference equations. Math Comput 26(120):859–868

    Article  MathSciNet  MATH  Google Scholar 

  40. Terekhov AV (2011) A fast parallel algorithm for solving block-tridiagonal systems of linear equations including the domain decomposition method. Parallel Comput 39(s 6–7):475–484

    MathSciNet  Google Scholar 

  41. Ruggiero V, Galligani E (1992) A parallel algorithm for solving block tridiagonal linear systems. Comput Math Appl 24(4):15–21

    Article  MathSciNet  MATH  Google Scholar 

  42. Gutknecht MH, Schmelzer T (2007) Updating the qr decomposition of block tridiagonal and block hessenberg matrices. Appl Numer Math 58(2008):871–883

    MathSciNet  MATH  Google Scholar 

  43. Koulaei MH, Toutounian F (2007) On computing of block ilu preconditioner for block tridiagonal systems. J Comput Appl Math 202(2):248–257

    Article  MathSciNet  MATH  Google Scholar 

  44. Yang W, Li K, Liu Y, Shi L, Wang C (2014) Optimization of quasi diagonal matrix-vector multiplication on gpu. Int J High Perform Comput Appl 28(2):181–193

    Article  Google Scholar 

  45. Li K, Yang W, Li K (2015) Performance analysis and optimization for SPMV on GPU using probabilistic modeling. IEEE Trans Parallel Distrib Syst 26:196–205. doi:10.1109/TPDS.2014.2308221

  46. Yang W, Li K, Mo Z, Li K (2015) Performance optimization using partitioned SPMV on GPUs and multicore cpus. IEEE Trans Comput 64(9):2623–2636

    Article  MathSciNet  Google Scholar 

  47. DAzevedo E, Hill J C (2012) Parallel lu factorization on GPU cluster. Proc Comp Sci 9(11):67–75

    Article  Google Scholar 

  48. Tomov S (2012) A hybridization methodology for high-performance linear algebra software for GPUs, Chap 34. Elsevier, Amsterdam

    Google Scholar 

  49. Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou J, Ltaief H, Luszczek P, Tomov S (2009) Numerical linear algebra on emerging architectures: the plasma and magma projects. J Phys Conf Seri, p 012037

  50. Davidson A, Zhang Y, Owens JD (2011) An auto-tuned method for solving large tridiagonal systems on the gpu. In: Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium, pp 956–965

  51. Göddeke D, Strzodka R (2011) Cyclic reduction tridiagonal solvers on GPUs applied to mixed-precision multigrid. IEEE Trans Parallel Distrib Syst 22(1):22–32

    Article  Google Scholar 

  52. László E, Giles M, Appleyard J (2016) Manycore algorithms for batch scalar and block tridiagonal solvers. ACM Trans Math Softw 42(4):31:1–31:36

    Article  MathSciNet  Google Scholar 

  53. NVIDIA (213) NVIDIA CUDA C programming guide, Tech. Rep

  54. NVIDIA (2015) Cusolver library, Tech. Rep

  55. NVIDIA (2015) Cusparse library, Tech. Rep

  56. PARALUTION Labs UG & Co. KG (2015) Paralution—user manual, Tech. Rep., Gaggenau

  57. Ziane Khodja L, Couturier R, Giersch A, Bahi J (2014) Parallel sparse linear solver with gmres method using minimization techniques of communications for gpu clusters. J Supercomput 69(1):200–224. doi:10.1007/s11227-014-1143-8

    Article  Google Scholar 

  58. Couturier R, Denis C, Jzquel F (2008) Gremlins: a large sparse linear solver for grid environment. Parallel Comput 34(6C8):380–391. Parallel Matrix Algorithms and Applications. http://www.sciencedirect.com/science/article/pii/S0167819107001354

  59. Jezequel F, Couturier R, Denis C (2012) Solving large sparse linear systems in a grid environment: the gremlins code versus the petsc library. J Supercomput 59(3):1517–1532. doi:10.1007/s11227-011-0563-y

    Article  Google Scholar 

  60. Smith B (2001) PETSC: portable, extensible toolkit for scientific computation. Encyclopedia of Parallel Computing, pp 1530–1539

  61. Householder AS (1964) The theory of matrices in numerical analysis. Dover, New York

    MATH  Google Scholar 

  62. Davis T A (2011) Algorithm 915, suitesparseqr: multifrontal multithreaded rank-revealing sparse qr factorization. ACM Trans Math Softw (TOMS) 38(1):8

    MathSciNet  Google Scholar 

  63. Davis TA, Yeralan SN, Ranka S (2015) Algorithm 9xx: sparse qr factorization on the GPU. ACM Trans Math Softw 1:1–28. doi:10.1145/0000000.0000000

    Google Scholar 

Download references

Acknowledgments

The authors deeply appreciate the anonymous reviewers for their comments on the manuscript. The research was partially funded by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005 and 61432005), the National Natural Science Foundation of China (Grant Nos. 61370095, 61472124, and 61572175), and the Science and technology project of Hunan Province (Grant No. 2015SK20062).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wangdong Yang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, W., Li, K. & Li, K. A parallel solving method for block-tridiagonal equations on CPU–GPU heterogeneous computing systems. J Supercomput 73, 1760–1781 (2017). https://doi.org/10.1007/s11227-016-1881-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1881-x

Keywords

Navigation