Abstract
As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem \((256\times 256\times 256)\) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss–Seidel like methods with intrinsic strong data dependency.











Similar content being viewed by others
References
Aftosmis M, Berger M, Biswas R, Djomehri MJ, Hood R, Jin H, Kiris C (2006) A detailed performance characterization of columbia using aeronautics benchmarks and applications. In: Proc. 44th AIAA Aerospace Sciences Meeting & Exhibit
Biswas R, Djomehri MJ, Hood R, Jin H, Kiris C, Saini S (2005) An application-based performance characterization of the columbia supercluster. In: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, p 26. IEEE Computer Society
Che Y, Cheng X, Xu C, Zhu X, Wang Z (2015) Performance engineering of a supersonic combustion simulator on heterogeneous platforms. In: Proceedings of 27th International Conference on Parallel Computational Fluid Dynamics
Chen R, Wang Z (2000) Fast, block lower-upper symmetric gauss-seidel scheme for arbitrary grids. AIAA j 38(12):2238–2245
Deng X, Mao M (1997) Weighted compact high-order nonlinear schemes for the euler equations. AIAA paper, pp 97–1941
Deng X, Mao M, Jiang Y, Liu H (2011) New high-order hybrid cell-edge and cell-node weighted compact nonlinear schemes. AIAA Pap 3857:2011
Deng X, Zhang H (2000) Developing high-order weighted compact nonlinear schemes. J Comput Phys 165(1):22–44
Djomehri MJ, Jin HH, Biegel B (2002) Hybrid mpi+ openmp programming of an overset cfd solver and performance investigations. Tech. rep., NASA Ames Research Center, NAS Technical Report, NAS-02-002
Economon TD, Palacios F, Alonso JJ, Bansal G, Mudigere D, Deshpande A, Heinecke A, Smelyanskiy M (2015) Towards high-performance optimizations of the unstructured open-source su2 suite. AIAA SciTech AIAA Pap 1949:2015
Fang J (2014) Towards a Systematic Exploration of the Optimization Space for Many-Core Processors. Delft University of Technology, Delft
Fang J, Sips H, Zhang L, Xu C, Che Y, Varbanescu AL (2014) Test-driving intel xeon phi. In: Proceedings of the 5th ACM/SPEC international conference on Performance engineering. ACM, pp 137–148
Gang W, Jiang Y, Zhengyin Y (2012) An improved lu-sgs implicit scheme for high reynolds number flow computations on hybrid unstructured mesh. Chin J Aeronaut 25(1):33–41
Li D, Xu C, Wang Y, Song Z, Xiong M, Gao X, Deng X (2015) Parallelizing and optimizing large-scale 3d multi-phase flow simulations on the tianhe-2 supercomputer. Practice and Experience, Concurrency and Computation
Li R, Wang X, Zhao W (2008) A multigrid block lu-sgs algorithm for euler equations on unstructured grids. Numer Math Theory Methods Appl 1:92–112
Liu W, Zhang L, Zhong Y, Wang Y, Che Y, Xu C, Cheng X (2015) Cfd high-order accurate scheme jacobian-free newton krylov method. Comput Fluids 110:43–47
Luo H, Sharov D, Baum JD, Löhner R (2003) Parallel unstructured grid gmres+ lu-sgs method for turbulent flows. AIAA Pap 273:2003
Otero E, Eliasson P (2011) Convergence acceleration of the cfd code edge by lu-sgs. In: 3rd CEAS European Air & Space Conference. CEAS/AIDAA, pp 606–611
Parsani M, Van den Abeele K, Lacor C (2007) Implicit lu-sgs time integration algorithm for high-order spectral volume method with p-multigrid strategy. In: West-East High-Speed Flow Field Conference, Moscow, Russia
Sharov D, Luo H, Baum JD, Löhner R (2000) Implementation of unstructured grid gmres+ lu-sgs method on shared-memory, cache-based parallel computers. AIAA Pap 927:2000
Sun Y, Wang Z, Liu Y (2009) Efficient implicit non-linear lu-sgs approach for compressible flow computation using high-order spectral difference method. commun. Comput Phys 5(2–4):760–778
Wang YX, Zhang LL, Che YG, Xu CF, Liu W, Cheng XH (2015) Efficient parallel computing and performance tuning for multi-block structured grid cfd applications on tianhe supercomputer. Tien Tzu Hsueh Pao/acta Electronica Sinica 43(1):36–44
Xu C, Deng X, Zhang L, Fang J, Wang G, Jiang Y, Cao W, Che Y, Wang Y, Wang Z et al (2014) Collaborating cpu and gpu for large-scale high-order cfd simulations with complex grids on the tianhe-1a supercomputer. J Comput Phys 278:275–297
Yamamoto S, Sasao Y, Sato S, Sano K (2007) Parallel-implicit computation of three-dimensional multistage stator-rotor cascade flows with condensation. In: Proc. 18th AIAA Computational Fluid Dynamics Conference, AIAA Paper, vol 4460, p 2007
Yoon S, Jameson A (1988) Lower-upper symmetric-gauss-seidel method for the euler and navier-stokes equations. AIAA J 26(9):1025–1026
Yoon S, Jost G, Chang S (2005) Parallelization of gauss-seidel relaxation for real gas flow. Tech. rep., NAS Technical Report, NAS-05-011
Zhang L, Wang Z (2004) A block lu-sgs implicit dual time-stepping algorithm for hybrid dynamic meshes. Comput Fluids 33(7):891–916
Acknowledgements
This paper was supported by the Basic Research Program of National University of Defense Technology under Grant No. ZDYYJCYJ20140101, the Open Research Program of China State Key Laboratory of Aerodynamics under Grant No. SKLA20160104, the Defense Industrial Technology Development Program under Grant No. C1520110002, and the National Science Foundation of China under Grant Nos. 11502296 and 61561146395.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, D., Xu, C., Cheng, B. et al. Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations. J Supercomput 73, 2506–2524 (2017). https://doi.org/10.1007/s11227-016-1943-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1943-0