Skip to main content
Log in

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this study, a communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU–GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code (GT5D). In the GT5D, its sparse matrix-vector multiplication operation (SpMV) is performed as a 17-point stencil-based computation. The specialized part for the GT5D is only in the SpMV, and the other parts are usable also for other application program codes. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in the previous study Idomura et al. (in: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (ScalA ’17), 2017. https://doi.org/10.1145/3148226.3148234) to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix–matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 (Pascal GP100) GPUs per compute node. The evaluation results show that the M-CA-GMRES or CA-GMRES for the GT5D is advantageous over the GMRES or the generalized conjugate residual method (GCR) on GPU clusters, especially when the problem size (vector length) is large so that the cost of the SpMV is less dominant. The M-CA-GMRES is 1.09 ×, 1.22 × and 1.50 × faster than the CA-GMRES, GCR and GMRES, respectively, when 64 GPUs are used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://www.pgroup.com/.

  2. https://www.open-mpi.org/.

  3. Actually, this algorithm is a truncated version of the GCR known as the \(\hbox {ORTHOMIN}(k=1)\) method [9], not the standard GCR [6, 22].

  4. https://developer.nvidia.com/cublas.

  5. The average time is calculated from the measured time divided by the step size s.

  6. The amount of the internode communications per node is \(8n_{y}n_{z}n_{v}\cdot 2\cdot 8\) bytes and \((8n_{y}\,+ 1n_x)n_{z}n_{v}\cdot 2\cdot 8\) bytes in case with \(p=16\) and \(p=64\), respectively.

  7. f\(_{\hbox {i3,j,k+2,l}}\), f\(_{\hbox {i4,j,k+2,l}}\), f\(_{\hbox {i1,j,k,l}}\), f\(_{\hbox {i2,j,k,l}}\), f\(_{\hbox {i5,j,k,l}}\), and f\(_{\hbox {i6,j,k,l}}\) in the loop of Algorithm 6.

  8. The Cholesky factorization is redundantly computed on all MPI processes while each computation is identical. We have additionally evaluated a CholQR implementation that firstly gathers all the local products to a single process, conducts the Cholesky factorization on the process, and broadcasts the Cholesky factor \(\varvec{R}\) to all processes; the implementation with gather and broadcast is a little slower than that with allreduce, due to the latency increase associated with the twice calls of collective communications.

  9. The batch count used is 1024, i.e., each sub-calculation performs a multiplication on the transposed \((n/1024)\text{-by-}c\) sub-matrix and \(c\text{-by-}c\) matrix. We have tested other batch counts (512 and 2048); the performance difference among them is small.

  10. The implicit solver of the GT5D is well-conditioned in terms of the convergence property. For ill-conditioned solvers, the use of more stable basis conversion (such as with the Newton basis [12]), or more stable TSQR algorithm (such as the SVQR [25] and CAQR [8]) is probably required.

  11. If the larger number of GPUs (i.e., \(p>64\)) is utilized, the allreduce cost is more dominant in the total solution time and the speedup ratio in the M-CA-GMRES is possibly higher than that in the GMRES or GCR.

References

  1. Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of the ISC High Performance Computing 2016, LNCS, vol 9697, pp 21–38. Springer

  2. Asahi Y, Latu G, Ina T, Idomura Y, Grandgirard V, Garbet X (2017) Optimization of fusion kernels on accelerators with indirect or strided memory access patterns. IEEE Trans Parallel Distrib Syst 28(7):1974–1988. https://doi.org/10.1109/TPDS.2016.2633349

    Article  Google Scholar 

  3. Bai Z, Hu D, Reichel L (1994) A Newton basis GMRES implementation. IMA J Numer Anal 14(4):563–581. https://doi.org/10.1093/imanum/14.4.563

    Article  MathSciNet  MATH  Google Scholar 

  4. Carson E (2015) Communication-avoiding Krylov subspace methods in theory and practice. PhD dissertation, University of California, Berkeley

  5. Chronopoulos AT, Gear CW (1989) s-Step iterative methods for symmetric linear systems. J Comput Appl Math 25(2):153–168. https://doi.org/10.1016/0377-0427(89)90045-9

    Article  MathSciNet  MATH  Google Scholar 

  6. Concus P, Golub GH (1976) A generalized conjugate gradient method for nonsymmetric systems of linear equations. In: Computing Methods in Applied Sciences and Engineering, Lecture Notes in Economics and Mathematical Systems, vol 134. Springer, pp 56–65. https://doi.org/10.1007/978-3-642-85972-4_4

    Google Scholar 

  7. Cumming B (November 2018) STREAM benchmark in CUDA C++. https://github.com/bcumming/cuda-stream. Accessed 5

  8. Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):A206–A239. https://doi.org/10.1137/080731992

    Article  MathSciNet  MATH  Google Scholar 

  9. Eisenstat SC, Elman HC, Schultz MH (1983) Variational iterative methods for nonsymmetric systems of linear equations. SIAM J Numer Anal 20(2):345–357. https://doi.org/10.1137/0720023

    Article  MathSciNet  MATH  Google Scholar 

  10. Fujita N, Nuga H, Boku T, Idomura Y (2013) Nuclear fusion simulation code optimization on GPU clusters. In: Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). IEEE, pp 1266–1274. https://doi.org/10.1109/ICPADS.2013.65

  11. Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, Baltimore

    MATH  Google Scholar 

  12. Hoemmen M (2010) Communication-avoiding Krylov subspace methods. PhD dissertation, University of California, Berkeley

  13. Idomura Y, Ida M, Kano T, Aiba N, Tokuda S (2008) Conservative global gyrokinetic toroidal full-f five-dimensional Vlasov simulation. Comput Phys Commun 179(6):391–403. https://doi.org/10.1016/j.cpc.2008.04.005

    Article  MathSciNet  MATH  Google Scholar 

  14. Idomura Y, Ina T, Mayumi A, Yamada S, Matsumoto K, Asahi Y, Imamura T (2017) Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional Eulerian code on many core platforms. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), p 7. https://doi.org/10.1145/3148226.3148234

  15. Idomura Y, Nakata M, Yamada S, Machida M, Imamura T, Watanabe T, Nunami M, Inoue H, Tsutsumi S, Miyoshi I, Shida N (2014) Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int J High Perform Comput Appl 28(1):73–86. https://doi.org/10.1177/1094342013490973

    Article  Google Scholar 

  16. Joubert WD, Carey GF (1992) Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: theory. Int J Comput Math 44(1–4):269–290. https://doi.org/10.1080/00207169208804107

    Article  MATH  Google Scholar 

  17. McCalpin JD (November 2018) STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/. Accessed 5

  18. Mohiyuddin M, Hoemmen M, Demmel J, Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09). ACM. https://doi.org/10.1145/1654059.1654096

  19. Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515. https://doi.org/10.1177/1094342010385729

    Article  Google Scholar 

  20. NVIDIA Corporation: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 5 Nov 2018

  21. Rosendale JV (1983) Minimizing inner product data dependencies in conjugate gradient iteration. Technical Report NASA-CR-17, NASA

  22. Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia

    Book  Google Scholar 

  23. Saad Y, Schultz MH (1986) GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869. https://doi.org/10.1137/0907058

    Article  MathSciNet  MATH  Google Scholar 

  24. Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). IEEE. https://doi.org/10.1109/SC.2010.9

  25. Stathopoulos A, Wu K (2002) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2184. https://doi.org/10.1137/S1064827500370883

    Article  MathSciNet  MATH  Google Scholar 

  26. de Sturler E, van der Vorst HA (1995) Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers. Appl Numer Math 18(4):441–459. https://doi.org/10.1016/0168-9274(95)00079-A

    Article  MATH  Google Scholar 

  27. Walker HF (1988) Implementation of the GMRES method using householder transformations. SIAM J Sci Stat Comput 9(1):152–163. https://doi.org/10.1137/0909010

    Article  MathSciNet  MATH  Google Scholar 

  28. Williams SW (2011) The roofline model. In: Bailey DH, Lucas RF, Williams SW (eds) Performance tuning of scientific applications, chapter 9. CRC Press, Boca Raton, pp 195–215

    Google Scholar 

  29. Yamazaki I, Anzt H, Tomov S, Hoemmen M, Dongarra J (2014) Improving the performance of CA-GMRES on multicores with multiple GPUs. IEEE, pp 382–391. https://doi.org/10.1109/IPDPS.2014.48

  30. Yamazaki I, Hoemmen M, Luszczek P, Dongarra J (2017) Improving performance of GMRES by reducing communication and pipelining global collectives. In: Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017). IEEE, pp 1118–1127. https://doi.org/10.1109/IPDPSW.2017.65

  31. Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M0973773

    Article  MathSciNet  MATH  Google Scholar 

  32. Yamazaki I, Tomov S, Dongarra JJ (2016) Stability and performance of various singular value QR implementations on multicore CPU with a GPU. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2898347

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This research was conducted using the SGI Rackable C1102-GP8 (Reedbush-L) in the Information Technology Center, the University of Tokyo. The use of HA-PACS/TCA in the software development for this research is offered under the “Interdisciplinary Computational Science Program” in Center for Computational Sciences, University of Tsukuba. A part of the algorithm development was conducted on the ICE-X at the JAEA. This work is partly supported by the MEXT (Grant for Post-K priority issue No.6: Development of Innovative Clean Energy). The authors thank Dr. Yuuichi Asahi at the QST for his helpful advice on the SpMV kernel.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuya Matsumoto.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Matsumoto, K., Idomura, Y., Ina, T. et al. Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster. J Supercomput 75, 8115–8146 (2019). https://doi.org/10.1007/s11227-019-02983-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02983-7

Keywords

Navigation