Abstract
In this study, a communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU–GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code (GT5D). In the GT5D, its sparse matrix-vector multiplication operation (SpMV) is performed as a 17-point stencil-based computation. The specialized part for the GT5D is only in the SpMV, and the other parts are usable also for other application program codes. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in the previous study Idomura et al. (in: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (ScalA ’17), 2017. https://doi.org/10.1145/3148226.3148234) to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix–matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 (Pascal GP100) GPUs per compute node. The evaluation results show that the M-CA-GMRES or CA-GMRES for the GT5D is advantageous over the GMRES or the generalized conjugate residual method (GCR) on GPU clusters, especially when the problem size (vector length) is large so that the cost of the SpMV is less dominant. The M-CA-GMRES is 1.09 ×, 1.22 × and 1.50 × faster than the CA-GMRES, GCR and GMRES, respectively, when 64 GPUs are used.
Similar content being viewed by others
Notes
The average time is calculated from the measured time divided by the step size s.
The amount of the internode communications per node is \(8n_{y}n_{z}n_{v}\cdot 2\cdot 8\) bytes and \((8n_{y}\,+ 1n_x)n_{z}n_{v}\cdot 2\cdot 8\) bytes in case with \(p=16\) and \(p=64\), respectively.
f\(_{\hbox {i3,j,k+2,l}}\), f\(_{\hbox {i4,j,k+2,l}}\), f\(_{\hbox {i1,j,k,l}}\), f\(_{\hbox {i2,j,k,l}}\), f\(_{\hbox {i5,j,k,l}}\), and f\(_{\hbox {i6,j,k,l}}\) in the loop of Algorithm 6.
The Cholesky factorization is redundantly computed on all MPI processes while each computation is identical. We have additionally evaluated a CholQR implementation that firstly gathers all the local products to a single process, conducts the Cholesky factorization on the process, and broadcasts the Cholesky factor \(\varvec{R}\) to all processes; the implementation with gather and broadcast is a little slower than that with allreduce, due to the latency increase associated with the twice calls of collective communications.
The batch count used is 1024, i.e., each sub-calculation performs a multiplication on the transposed \((n/1024)\text{-by-}c\) sub-matrix and \(c\text{-by-}c\) matrix. We have tested other batch counts (512 and 2048); the performance difference among them is small.
If the larger number of GPUs (i.e., \(p>64\)) is utilized, the allreduce cost is more dominant in the total solution time and the speedup ratio in the M-CA-GMRES is possibly higher than that in the GMRES or GCR.
References
Abdelfattah A, Haidar A, Tomov S, Dongarra J (2016) Performance, design, and autotuning of batched GEMM for GPUs. In: Proceedings of the ISC High Performance Computing 2016, LNCS, vol 9697, pp 21–38. Springer
Asahi Y, Latu G, Ina T, Idomura Y, Grandgirard V, Garbet X (2017) Optimization of fusion kernels on accelerators with indirect or strided memory access patterns. IEEE Trans Parallel Distrib Syst 28(7):1974–1988. https://doi.org/10.1109/TPDS.2016.2633349
Bai Z, Hu D, Reichel L (1994) A Newton basis GMRES implementation. IMA J Numer Anal 14(4):563–581. https://doi.org/10.1093/imanum/14.4.563
Carson E (2015) Communication-avoiding Krylov subspace methods in theory and practice. PhD dissertation, University of California, Berkeley
Chronopoulos AT, Gear CW (1989) s-Step iterative methods for symmetric linear systems. J Comput Appl Math 25(2):153–168. https://doi.org/10.1016/0377-0427(89)90045-9
Concus P, Golub GH (1976) A generalized conjugate gradient method for nonsymmetric systems of linear equations. In: Computing Methods in Applied Sciences and Engineering, Lecture Notes in Economics and Mathematical Systems, vol 134. Springer, pp 56–65. https://doi.org/10.1007/978-3-642-85972-4_4
Cumming B (November 2018) STREAM benchmark in CUDA C++. https://github.com/bcumming/cuda-stream. Accessed 5
Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):A206–A239. https://doi.org/10.1137/080731992
Eisenstat SC, Elman HC, Schultz MH (1983) Variational iterative methods for nonsymmetric systems of linear equations. SIAM J Numer Anal 20(2):345–357. https://doi.org/10.1137/0720023
Fujita N, Nuga H, Boku T, Idomura Y (2013) Nuclear fusion simulation code optimization on GPU clusters. In: Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013). IEEE, pp 1266–1274. https://doi.org/10.1109/ICPADS.2013.65
Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. The John Hopkins University Press, Baltimore
Hoemmen M (2010) Communication-avoiding Krylov subspace methods. PhD dissertation, University of California, Berkeley
Idomura Y, Ida M, Kano T, Aiba N, Tokuda S (2008) Conservative global gyrokinetic toroidal full-f five-dimensional Vlasov simulation. Comput Phys Commun 179(6):391–403. https://doi.org/10.1016/j.cpc.2008.04.005
Idomura Y, Ina T, Mayumi A, Yamada S, Matsumoto K, Asahi Y, Imamura T (2017) Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional Eulerian code on many core platforms. In: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA ’17), p 7. https://doi.org/10.1145/3148226.3148234
Idomura Y, Nakata M, Yamada S, Machida M, Imamura T, Watanabe T, Nunami M, Inoue H, Tsutsumi S, Miyoshi I, Shida N (2014) Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer. Int J High Perform Comput Appl 28(1):73–86. https://doi.org/10.1177/1094342013490973
Joubert WD, Carey GF (1992) Parallelizable restarted iterative methods for nonsymmetric linear systems. Part I: theory. Int J Comput Math 44(1–4):269–290. https://doi.org/10.1080/00207169208804107
McCalpin JD (November 2018) STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream/. Accessed 5
Mohiyuddin M, Hoemmen M, Demmel J, Yelick K (2009) Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09). ACM. https://doi.org/10.1145/1654059.1654096
Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515. https://doi.org/10.1177/1094342010385729
NVIDIA Corporation: NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 5 Nov 2018
Rosendale JV (1983) Minimizing inner product data dependencies in conjugate gradient iteration. Technical Report NASA-CR-17, NASA
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia
Saad Y, Schultz MH (1986) GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J Sci Stat Comput 7(3):856–869. https://doi.org/10.1137/0907058
Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010). IEEE. https://doi.org/10.1109/SC.2010.9
Stathopoulos A, Wu K (2002) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2184. https://doi.org/10.1137/S1064827500370883
de Sturler E, van der Vorst HA (1995) Reducing the effect of global communication in GMRES(m) and CG on parallel distributed memory computers. Appl Numer Math 18(4):441–459. https://doi.org/10.1016/0168-9274(95)00079-A
Walker HF (1988) Implementation of the GMRES method using householder transformations. SIAM J Sci Stat Comput 9(1):152–163. https://doi.org/10.1137/0909010
Williams SW (2011) The roofline model. In: Bailey DH, Lucas RF, Williams SW (eds) Performance tuning of scientific applications, chapter 9. CRC Press, Boca Raton, pp 195–215
Yamazaki I, Anzt H, Tomov S, Hoemmen M, Dongarra J (2014) Improving the performance of CA-GMRES on multicores with multiple GPUs. IEEE, pp 382–391. https://doi.org/10.1109/IPDPS.2014.48
Yamazaki I, Hoemmen M, Luszczek P, Dongarra J (2017) Improving performance of GMRES by reducing communication and pipelining global collectives. In: Proceedings of the 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017). IEEE, pp 1118–1127. https://doi.org/10.1109/IPDPSW.2017.65
Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M0973773
Yamazaki I, Tomov S, Dongarra JJ (2016) Stability and performance of various singular value QR implementations on multicore CPU with a GPU. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2898347
Acknowledgements
This research was conducted using the SGI Rackable C1102-GP8 (Reedbush-L) in the Information Technology Center, the University of Tokyo. The use of HA-PACS/TCA in the software development for this research is offered under the “Interdisciplinary Computational Science Program” in Center for Computational Sciences, University of Tsukuba. A part of the algorithm development was conducted on the ICE-X at the JAEA. This work is partly supported by the MEXT (Grant for Post-K priority issue No.6: Development of Innovative Clean Energy). The authors thank Dr. Yuuichi Asahi at the QST for his helpful advice on the SpMV kernel.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Matsumoto, K., Idomura, Y., Ina, T. et al. Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster. J Supercomput 75, 8115–8146 (2019). https://doi.org/10.1007/s11227-019-02983-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02983-7