Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Matsumoto, Kazuya; Idomura, Yasuhiro; Ina, Takuya; Mayumi, Akie; Yamada, Susumu

doi:10.1007/s11227-019-02983-7

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Published: 05 September 2019

Volume 75, pages 8115–8146, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kazuya Matsumoto ORCID: orcid.org/0000-0001-5858-1598¹,
Yasuhiro Idomura²,
Takuya Ina²,
Akie Mayumi² &
…
Susumu Yamada²

317 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

In this study, a communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU–GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code (GT5D). In the GT5D, its sparse matrix-vector multiplication operation (SpMV) is performed as a 17-point stencil-based computation. The specialized part for the GT5D is only in the SpMV, and the other parts are usable also for other application program codes. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in the previous study Idomura et al. (in: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (ScalA ’17), 2017. https://doi.org/10.1145/3148226.3148234) to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix–matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 (Pascal GP100) GPUs per compute node. The evaluation results show that the M-CA-GMRES or CA-GMRES for the GT5D is advantageous over the GMRES or the generalized conjugate residual method (GCR) on GPU clusters, especially when the problem size (vector length) is large so that the cost of the SpMV is less dominant. The M-CA-GMRES is 1.09 ×, 1.22 × and 1.50 × faster than the CA-GMRES, GCR and GMRES, respectively, when 64 GPUs are used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Article 04 September 2019

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Notes

https://www.pgroup.com/.
https://www.open-mpi.org/.
Actually, this algorithm is a truncated version of the GCR known as the \(\hbox {ORTHOMIN}(k=1)\) method [9], not the standard GCR [6, 22].
https://developer.nvidia.com/cublas.
The average time is calculated from the measured time divided by the step size s.
The amount of the internode communications per node is \(8n_{y}n_{z}n_{v}\cdot 2\cdot 8\) bytes and \((8n_{y}\,+ 1n_x)n_{z}n_{v}\cdot 2\cdot 8\) bytes in case with \(p=16\) and \(p=64\), respectively.
f\(_{\hbox {i3,j,k+2,l}}\), f\(_{\hbox {i4,j,k+2,l}}\), f\(_{\hbox {i1,j,k,l}}\), f\(_{\hbox {i2,j,k,l}}\), f\(_{\hbox {i5,j,k,l}}\), and f\(_{\hbox {i6,j,k,l}}\) in the loop of Algorithm 6.
The Cholesky factorization is redundantly computed on all MPI processes while each computation is identical. We have additionally evaluated a CholQR implementation that firstly gathers all the local products to a single process, conducts the Cholesky factorization on the process, and broadcasts the Cholesky factor \(\varvec{R}\) to all processes; the implementation with gather and broadcast is a little slower than that with allreduce, due to the latency increase associated with the twice calls of collective communications.
The batch count used is 1024, i.e., each sub-calculation performs a multiplication on the transposed \((n/1024)\text{-by-}c\) sub-matrix and \(c\text{-by-}c\) matrix. We have tested other batch counts (512 and 2048); the performance difference among them is small.
The implicit solver of the GT5D is well-conditioned in terms of the convergence property. For ill-conditioned solvers, the use of more stable basis conversion (such as with the Newton basis [12]), or more stable TSQR algorithm (such as the SVQR [25] and CAQR [8]) is probably required.
If the larger number of GPUs (i.e., \(p>64\)) is utilized, the allreduce cost is more dominant in the total solution time and the speedup ratio in the M-CA-GMRES is possibly higher than that in the GMRES or GCR.

References

Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of the ISC High Performance Computing 2016, LNCS, vol 9697, pp 21–38. Springer
Asahi Y, Latu G, Ina T, Idomura Y, Grandgirard V, Garbet X (2017) Optimization of fusion kernels on accelerators with indirect or strided memory access patterns. IEEE Trans Parallel Distrib Syst 28(7):1974–1988. https://doi.org/10.1109/TPDS.2016.2633349
Article Google Scholar
Bai Z, Hu D, Reichel L (1994) A Newton basis GMRES implementation. IMA J Numer Anal 14(4):563–581. https://doi.org/10.1093/imanum/14.4.563
Article MathSciNet MATH Google Scholar
Carson E (2015) Communication-avoiding Krylov subspace methods in theory and practice. PhD dissertation, University of California, Berkeley
Chronopoulos AT, Gear CW (1989) s-Step iterative methods for symmetric linear systems. J Comput Appl Math 25(2):153–168. https://doi.org/10.1016/0377-0427(89)90045-9
Article MathSciNet MATH Google Scholar
Concus P, Golub GH (1976) A generalized conjugate gradient method for nonsymmetric systems of linear equations. In: Computing Methods in Applied Sciences and Engineering, Lecture Notes in Economics and Mathematical Systems, vol 134. Springer, pp 56–65. https://doi.org/10.1007/978-3-642-85972-4_4
Google Scholar
Cumming B (November 2018) STREAM benchmark in CUDA C++. https://github.com/bcumming/cuda-stream. Accessed 5
Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):A206–A239. https://doi.org/10.1137/080731992
Article MathSciNet MATH Google Scholar
Eisenstat SC, Elman HC, Schultz MH (1983) Variational iterative methods for nonsymmetric systems of linear equations. SIAM J Numer Anal 20(2):345–357. https://doi.org/10.1137/0720023
Article MathSciNet MATH Google Scholar
Fujita N, Nuga H, Boku T, Idomura Y (2013) Nuclear fusion simulation code optimization on GPU clusters. In: Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). IEEE, pp 1266–1274. https://doi.org/10.1109/ICPADS.2013.65
Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, Baltimore
MATH Google Scholar
Hoemmen M (2010) Communication-avoiding Krylov subspace methods. PhD dissertation, University of California, Berkeley
Idomura Y, Ida M, Kano T, Aiba N, Tokuda S (2008) Conservative global gyrokinetic toroidal full-f five-dimensional Vlasov simulation. Comput Phys Commun 179(6):391–403. https://doi.org/10.1016/j.cpc.2008.04.005
Article MathSciNet MATH Google Scholar
Idomura Y, Ina T, Mayumi A, Yamada S, Matsumoto K, Asahi Y, Imamura T (2017) Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional Eulerian code on many core platforms. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), p 7. https://doi.org/10.1145/3148226.3148234
Idomura Y, Nakata M, Yamada S, Machida M, Imamura T, Watanabe T, Nunami M, Inoue H, Tsutsumi S, Miyoshi I, Shida N (2014) Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int J High Perform Comput Appl 28(1):73–86. https://doi.org/10.1177/1094342013490973
Article Google Scholar
Joubert WD, Carey GF (1992) Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: theory. Int J Comput Math 44(1–4):269–290. https://doi.org/10.1080/00207169208804107
Article MATH Google Scholar
McCalpin JD (November 2018) STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/. Accessed 5
Mohiyuddin M, Hoemmen M, Demmel J, Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09). ACM. https://doi.org/10.1145/1654059.1654096
Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515. https://doi.org/10.1177/1094342010385729
Article Google Scholar
NVIDIA Corporation: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 5 Nov 2018
Rosendale JV (1983) Minimizing inner product data dependencies in conjugate gradient iteration. Technical Report NASA-CR-17, NASA
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia
Book Google Scholar
Saad Y, Schultz MH (1986) GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869. https://doi.org/10.1137/0907058
Article MathSciNet MATH Google Scholar
Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). IEEE. https://doi.org/10.1109/SC.2010.9
Stathopoulos A, Wu K (2002) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2184. https://doi.org/10.1137/S1064827500370883
Article MathSciNet MATH Google Scholar
de Sturler E, van der Vorst HA (1995) Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers. Appl Numer Math 18(4):441–459. https://doi.org/10.1016/0168-9274(95)00079-A
Article MATH Google Scholar
Walker HF (1988) Implementation of the GMRES method using householder transformations. SIAM J Sci Stat Comput 9(1):152–163. https://doi.org/10.1137/0909010
Article MathSciNet MATH Google Scholar
Williams SW (2011) The roofline model. In: Bailey DH, Lucas RF, Williams SW (eds) Performance tuning of scientific applications, chapter 9. CRC Press, Boca Raton, pp 195–215
Google Scholar
Yamazaki I, Anzt H, Tomov S, Hoemmen M, Dongarra J (2014) Improving the performance of CA-GMRES on multicores with multiple GPUs. IEEE, pp 382–391. https://doi.org/10.1109/IPDPS.2014.48
Yamazaki I, Hoemmen M, Luszczek P, Dongarra J (2017) Improving performance of GMRES by reducing communication and pipelining global collectives. In: Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017). IEEE, pp 1118–1127. https://doi.org/10.1109/IPDPSW.2017.65
Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M0973773
Article MathSciNet MATH Google Scholar
Yamazaki I, Tomov S, Dongarra JJ (2016) Stability and performance of various singular value QR implementations on multicore CPU with a GPU. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2898347
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research was conducted using the SGI Rackable C1102-GP8 (Reedbush-L) in the Information Technology Center, the University of Tokyo. The use of HA-PACS/TCA in the software development for this research is offered under the “Interdisciplinary Computational Science Program” in Center for Computational Sciences, University of Tsukuba. A part of the algorithm development was conducted on the ICE-X at the JAEA. This work is partly supported by the MEXT (Grant for Post-K priority issue No.6: Development of Innovative Clean Energy). The authors thank Dr. Yuuichi Asahi at the QST for his helpful advice on the SpMV kernel.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan
Kazuya Matsumoto
Center for Computational Science and e-Systems, Japan Atomic Energy Agency, Kashiwa City, Chiba, Japan
Yasuhiro Idomura, Takuya Ina, Akie Mayumi & Susumu Yamada

Authors

Kazuya Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Yasuhiro Idomura
View author publications
You can also search for this author in PubMed Google Scholar
Takuya Ina
View author publications
You can also search for this author in PubMed Google Scholar
Akie Mayumi
View author publications
You can also search for this author in PubMed Google Scholar
Susumu Yamada
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuya Matsumoto.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Matsumoto, K., Idomura, Y., Ina, T. et al. Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster. J Supercomput 75, 8115–8146 (2019). https://doi.org/10.1007/s11227-019-02983-7

Download citation

Published: 05 September 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11227-019-02983-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Abstract

Access this article

Similar content being viewed by others

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Parallelizing the dual revised simplex method

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

Abstract

Access this article

Similar content being viewed by others

Development of a 3D Hybrid Finite-Discrete Element Simulator Based on GPGPU-Parallelized Computation for Modelling Rock Fracturing Under Quasi-Static and Dynamic Loading Conditions

Parallelizing the dual revised simplex method

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation