Abstract
Utilization of high-order spatial discretizations is an important trend in developing global atmospheric models. As a competitive choice, the multi-moment constrained volume (MCV) method can achieve high accuracy while maintaining similar parallel scalability to classical finite volume methods. In this work, we introduce the development of a hybrid parallel MCV-based global shallow-water model on the cubed-sphere grid. Based on a sequential code, we perform parallelization on both the process and the thread levels. To enable process-level parallelism, we first decompose the six patches of the cubed-sphere in a same 2-D partition and then employ a conflict-free pipe-flow communication scheme for overlapping the halo exchange with computations. To further exploit the heterogeneous computing capacity of an Intel Xeon Phi accelerated supercomputer, we propose a guided panel-based inner–outer partition to distribute workload among the CPUs and the coprocessors. In addition to the above, thread-level parallelism along with various optimizations is done on both the multi-core CPU and the many-core accelerator. Numerical experiments are carried out to validate the correctness of the optimized parallel code and examine its parallel performance. Test results show that both the CPU-only and the hybrid codes scale well to hundreds of processes in terms of both the strong and weak scaling. In particular, the hybrid code can achieve a speedup of \(2.56\times \) as compared to the CPU-only version. In the largest run on a \(9216\,\times \,9216\,\times \,6\) mesh (1.5 billion unknowns), the hybrid code sustains an aggregative performance of 26.5 Tflops with 486 processes (33,534 cores).













Similar content being viewed by others
References
Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192
Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188
Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–433
Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. In: Computer graphics forum, vol 8, p 312
Bernaschi M, Bisson M, Endo T, Matsuoka S, Fatica M, Melchionna S (2011) Petaflop biofluidics simulations on a two million-core system. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p 4
Bhandarkar SM, Arabnia HR (1995) The hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Bhandarkar SM, Arabnia HR (1995) The refine multiprocessor: theoretical properties and algorithms. Parallel Comput 21(11):1783–1805
Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(2):201–229
Carpenter I, Archibald RK, Evans KJ, Larkin J, Micikevicius P, Norman M, Rosinski J, Schwarzmeier J, Taylor MA (2013) Progress towards accelerating HOMME on hybrid multi-core systems. Int J High Perform Comput Appl 27(3):335–347
Chen C, Li X, Shen X, Xiao F (2014) Global shallow water models based on multi-moment constrained finite volume method and three quasi-uniform spherical grids. J Comput Phys 271:191–223
Chen C, Xiao F (2008) Shallow water model on cubed-sphere by multi-moment finite volume method. J Comput Phys 227(10):5019–5044
Cockburn B, Karniadakis G, Shu C, Griebel M (2000) Discontinuous Galerkin methods theory, computation and applications. Lectures notes in computational science and engineering, vol 11. Inc. Marzo del
Cumming B, Osuna C, Gysi T, Bianco M, Lapillonne X, Fuhrer O, Schulthess TC (2013) A review of the challenges and results of refactoring the community climate code cosmo for hybrid cray hpc systems. In: Proceedings of Cray User Group
Demeshko I, Maruyama N, Tomita H, Matsuoka S (2012) Multi-GPU implementation of the NICAM atmospheric model. In: European Conference on Parallel Processing. Springer, pp 175–184
Govett M, Middlecoff J, Henderson T (2014) Directive-based parallelization of the NIM weather model for GPUs. In: First Workshop on Accelerator Programming using Directives (WACCPD), pp 55–61. doi:10.1109/WACCPD.2014.9
Hamada T, Narumi T, Yokota R, Yasuoka K, Nitadori K, Taiji M (2009) 42 Tflops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, p 62
Hamada T, Nitadori K (2010) 190 Tflops astrophysical N-body simulation on a cluster of GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp 1–9
Hamilton K, Ohfuchi W (2007) High resolution numerical modelling of the atmosphere and ocean. Springer, Berlin
Huang M, Mielikainen J, Huang B, Chen H, Huang HLA, Goldberg MD (2015) Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme. Geosci Model Dev 8(9):2977–2990
Huynh H (2007) A flux reconstruction approach to high-order schemes including discontinuous Galerkin methods. AIAA paper 4079
Ii S, Xiao F (2010) A global shallow water model using high order multi-moment constrained finite volume method and icosahedral grid. J Comput Phys 229(5):1774–1796
Jameson A, Schmidt W, Turkel E et al (1981) Numerical solutions of the Euler equations by finite volume methods using Runge-Kutta time-stepping schemes. AIAA paper 1259
Li X, Chen C, Xiao F, Shen X (2015) A high-order multi-moment constrained finite-volume global shallow-water model on the Yin-Yang grid. Q J Royal Meteorol Soc 141(691):2090–2102
Li X, Chen D, Peng X, Takahashi K, Xiao F (2008) A multimoment finite-volume shallow-water model on the Yin-Yang overset spherical grid. Mon Weather Rev 136(8):3066–3086
Message Passing Interface Forum. http://www.mpi-forum.org/
Michalakes J, Vachharajani M (2008) GPU acceleration of numerical weather prediction. Parallel Process Lett 18(04):531–548
Mielikainen J, Huang B, Huang H-LA (2014) Intel Many Integrated Core (MIC) architecture optimization strategies for a memory-bound Weather Research and Forecasting (WRF) Goddard microphysics scheme. Proc SPIE Int Soc Opt Eng 9247(4):1–9
Mielikainen J, Huang B, Huang HA (2015) Optimizing total energy mass flux (TEMF) planetary boundary layer scheme for Intels many integrated core (MIC) architecture. IEEE J Sel Top Appl Earth Obs Remote Sens 8(8):4106–4119
Mielikainen J, Huang B, Huang HL, Goldberg M, Mehta A (2013) Speeding up the computation of WRF double-moment 6-class microphysics scheme with GPU. J Atmos Ocean Technol 30(12):2896–2906
Mielikainen J, Huang B, Huang HLA (2016) Optimizing Purdue-Lin Microphysics Scheme for Intel Xeon Phi Coprocessor. IEEE J Sel Top Appl Earth Obs Remote Sens 9(1):425–438
Mielikainen J, Huang B, Huang HLA, Goldberg MD (2012) Improved GPU/CUDA based parallel weather and research forecast (WRF) single moment 5-class (WSM5) cloud microphysics. IEEE J Sel Top Appl Earth Obs Remote Sens 5(4):1256–1265
Price E, Mielikainen J, Huang B, Huang H-LA, Lee T (2013) GPU acceleration experience with RRTMG long wave radiation model. Proc SPIE Int Soc Opt Eng 8895(H):1–12
Mielikainen J, Huang B, Wang J, Huang HLA, Goldberg MD (2013) Compute unified device architecture (CUDA)-based parallelization of WRF Kessler cloud microphysics scheme. Comput Geosci 52:292–299
Mielikainen J, Price E, Huang B, Huang HLA (2015) GPU compute unified device architecture (CUDA)-based parallelization of the RRTMG shortwave rapid radiative transfer model. IEEE J Sel Top Appl Earth Obs Remote Sens 9(2):1–11
Miura H, Satoh M, Nasuno T, Noda AT, Oouchi K (2007) A Madden-Julian oscillation event realistically simulated by a global cloud-resolving model. Science 318(5857):1763–1765
PAPI: performance application programming interface. http://icl.cs.utk.edu/papi/index.html
Patera AT (1984) A spectral element method for fluid dynamics: laminar flow in a channel expansion. J Comput Phys 54(3):468–488
Putman WM, Suarez M (2011) Cloud-system resolving simulations with the NASA Goddard Earth Observing System global atmospheric model (GEOS-5). Geophys Res Lett 38(16):239–255
Ronchi C, Iacono R, Paolucci PS (1996) The cubed sphere: a new method for the solution of partial differential equations in spherical geometry. J Comput Phys 124(1):93–114
Sadourny R (1972) Conservative finite-difference approximations of the primitive equations on quasi-uniform spherical grids. Mon Weather Rev 100(2):136–144
Shimokawabe T, Aoki T, Ishida J, Kawano K, Muroi C (2011) 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction. Proced Comput Sci 4:1535–1544
Shimokawabe T, Aoki T, Muroi C, Ishida J, Kawano K, Endo T, Nukada A, Maruyama N, Matsuoka S (2010) An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, pp 1–11
Shimokawabe T, Aoki T, Takaki T, Yamanaka A, Nukada A, Endo T, Maruyama N, Matsuoka S (2011) Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), IEEE, pp 1–11
Vu VT, Cats G, Wolters L (2013) Graphics processing unit optimizations for the dynamics of the HIRLAM weather forecast model. Concurr Comput Pract Exp 25(10):1376–1393
Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable multiring network. J Supercomput 25(1):43–62
Williamson DL, Drake JB, Hack JJ, Jakob R, Swarztrauber PN (1992) A standard test set for numerical approximations to the shallow water equations in spherical geometry. J Comput Phys 102(1):211–224
Xu S, Huang X, Zhang Y, Hu Y, Fu H, Yang G (2014) Porting the Princeton ocean model to GPUs. In: Algorithms and Architectures for Parallel Processing. Springer, pp 1–14
Xue W, Yang C, Fu H, Wang X, Xu Y, Gan L, Lu Y, Zhu X (2014) Enabling and scaling a global shallow-water atmospheric model on Tianhe-2. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Proceeding Symposium (IPDPS’14), pp 745–754
Yang C, Xue W, Fu H, Gan L, Li L, Xu Y, Lu Y, Sun J, Yang G, Zheng W (2013) A Peta-scalable CPU-GPU Algorithm for Global Atmospheric Simulations. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13ACM, New York, pp 1–12
Zhang P, Ao Y, Yang C, Liu Y, Liu F, Wu C, Zhao H (2015) Pattern-driven hybrid multi-and many-core acceleration in the MPAS shallow-water model. In: 2015 44th International Conference on Parallel Processing (ICPP), IEEE, pp 71–80
Acknowledgements
This work was supported in part by Natural Science Foundation of China (Grant# 91530323), National Key R&D Plan of China (Grant# 2016YFB0200600), National Key Technology R&D Program of China (Grant# 2012BAC22B01), and Chinese Academy of Sciences (Grant# QYZDB-SSWSYS006).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, P., Yang, C., Chen, C. et al. Development of a hybrid parallel MCV-based high-order global shallow-water model. J Supercomput 73, 2823–2842 (2017). https://doi.org/10.1007/s11227-017-1958-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-1958-1