Abstract
We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i) inter-node parallelization via spatial decomposition; (ii) inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii) data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv) register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98 % on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking + multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem.
Similar content being viewed by others
References
Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Pakin S, Sancho JC (2008) Entering the petaflop era: the architecture and performance of Roadrunner. In: Proceedings of the 2008 international conference for high performance computing, networking, storage and analysis, Austin, Texas. IEEE Comput Soc, Los Alamitos
Carrington L, Komatitsch D, Laurenzano M, Tikir MM, Michea D, Goff NL, Snavely A, Tromp J (2008) High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors. In: Proceedings of the 2008 international conference for high performance computing, networking, storage and analysis, Austin, Texas. IEEE Comput Soc, Los Alamitos
Komatitsch D, Erlebacher G, Göddeke D, Michéa D (2010) High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J Comput Phys 229:7692–7714
Zhao S, Wei GW (2004) High-order FDTD methods via derivative matching for Maxwell’s equations with material interfaces. J Comput Phys 200:60–103
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P (2010) 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 international conference for high performance computing, networking, storage and analysis, New Orleans, Louisiana. IEEE Comput Soc, Los Alamitos
Wellein G, Hager G, Zeiser T, Wittmann M, Fehske H (2009) Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proceedings of the 2009 IEEE international computer software and applications conference, Seattle, Washington. IEEE Comput Soc, Los Alamitos
Williams S, Carter J, Oliker L, Shalf J, Yelick K (2009) Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms. J Parallel Distrib Comput 69:762–777
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 international conference for high performance computing, networking, storage and analysis, Austin, Texas. IEEE Comput Soc, Los Alamitos
Peng L, Seymour R, Nomura K, Kalia RK, Nakano A, Vashishta P, Loddoch A, Netzband M, Volz WR, Wong CC (2009) High-order stencil computations on multicore clusters. In: Proceedings of the 23rd IEEE international parallel and distributed processing symposium, Rome, Italy. IEEE Comput Soc, Los Alamitos
Rivera G, Tseng C-W (2000) Tiling optimizations for 3D scientific computations. In: Proceedings of the 2000 international conference for high performance computing, networking, storage and analysis, Dallas, Texas. IEEE Comput Soc, Los Alamitos
Frigo M, Strumpen V (2005) Cache oblivious stencil computations. In: Proceedings of the 2005 international conference on supercomputing, Cambridge, Massachusetts. IEEE Comput Soc, Los Alamitos
Wonnacott D (2000) Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the 14th IEEE international parallel and distributed processing symposium, Cancun, Mexico
Renganarayana L, Harthikote-Matha M, Dewri R, Rajopadhye SV (2007) Towards optimal multi-level tiling for stencil computations. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium, Long Beach, California. IEEE Comput Soc, Los Alamitos
Dursun H, Nomura K, Peng L, Seymour R, Wang W, Kalia RK, Nakano A, Vashishta P (2009) A multilevel parallelization framework for high-order stencil computations. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Delft, The Netherlands. Springer, Berlin
Shen G, Cangellaris AC (2007) A new FDTD stencil for reduced numerical anisotropy in the computer modeling of wave phenomena: research articles. Int J RF Microw Comput-Aided Eng 17:447–454
Nakano A, Vashishta P, Kalia RK (1994) Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Comput Phys Commun 83:197–214
Parker M, Ketcham S, Cudney H (2007) Acoustic wave propagation in urban environments. In: Proceedings of the 2007 DoD high performance computing modernization program users group conference, Pittsburgh, Pennsylvania. IEEE Comput Soc, Los Alamitos
Dang DM, Christara CC, Jackson KR (2010) Pricing multi-asset American options on graphics processing units using a PDE approach. In: 3rd workshop on high performance computational finance (WHPCF), in conjunction with the 2010 international conference for high performance computing, networking, storage and analysis, New Orleans, Louisiana. IEEE Comput Soc, Los Alamitos
PARKBENCH: PARallel kernels and BENCHmarks. Available from http://www.netlib.org/parkbench
Bailey D, Barton J, Laninski T, Simon H (1991) The NAS parallel benchmarks. NASA Ames Research Center, Moffett Field
Bromley M, Heller S, McNerney T, Steele JGL (1991) Fortran at ten gigaflops: the connection machine convolution compiler. In: Proceedings of the ACM SIGPLAN 1991 conference on programming language design and implementation, Toronto, Ontario, Canada. ACM, New York
Roth G, Mellor-Crummey J, Kennedy K, Brickner RG (1997) Compiling stencils in high performance Fortran. In: Proceedings of the 1997 international conference for high performance computing, networking, storage and analysis, San Jose, CA. IEEE Comput Soc, Los Alamitos
Bordawekar R, Choudhary A, Ramanujam J (1996) Automatic optimization of communication in compiling out-of-core stencil codes. In: Proceedings of the 1996 international conference for high performance computing, networking, storage and analysis, Philadelphia, Pennsylvania, United States. IEEE Comput Soc, Los Alamitos
Ramanujam J, Krishnamurthy S, Hong J, Kandemir M (2002) Address code and arithmetic optimizations for embedded systems. In: Proceedings of the 2002 conference on Asia south pacific design automation/VLSI design, Bangalore, India. IEEE Comput Soc, Los Alamitos
Shimojo F, Kalia RK, Nakano A, Vashishta P (2008) Divide-and-conquer density functional theory on hierarchical real-space grids: parallel implementation and applications. Phys Rev B, Condens Matter Mater Phys 77:085103
Stathopoulos A, Öğüt S, Saad Y, Chelikowsky JR, Kim H (2000) Parallel methods and tools for predicting material properties. Comput Sci Eng 2:19–32
Snir M, Otto S (1998) MPI: the complete reference: the MPI core. MIT Press, Cambridge
Lam MS, Wolf ME (2004) A data locality optimizing algorithm. ACM SIGPLAN Not 39:442–459
Chen C, Chame J, Hall M (2008) CHiLL: a framework for composing high-level loop transformations. USC computer science technical report
IBM (2008) IBM system Blue Gene Solution: Blue Gene/P application development
Acknowledgements
This work was partially supported by NSF PetaApps/EMT/CMMI, DOE SciDAC/SciDAC-e/BES/INCITE, and DTRA. Performance tests were carried out at the Collaboratory for Advanced Computing and Simulations and High Performance Computing Center of the University of Southern California. This research also used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dursun, H., Kunaseth, M., Nomura, Ki. et al. Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. J Supercomput 62, 946–966 (2012). https://doi.org/10.1007/s11227-012-0764-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0764-z