Skip to main content
Log in

Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i) inter-node parallelization via spatial decomposition; (ii) inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii) data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv) register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98 % on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking + multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Pakin S, Sancho JC (2008) Entering the petaflop era: the architecture and performance of Roadrunner. In: Proceedings of the 2008 international conference for high performance computing, networking, storage and analysis, Austin, Texas. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  2. Carrington L, Komatitsch D, Laurenzano M, Tikir MM, Michea D, Goff NL, Snavely A, Tromp J (2008) High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors. In: Proceedings of the 2008 international conference for high performance computing, networking, storage and analysis, Austin, Texas. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  3. Komatitsch D, Erlebacher G, Göddeke D, Michéa D (2010) High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J Comput Phys 229:7692–7714

    Article  MathSciNet  MATH  Google Scholar 

  4. Zhao S, Wei GW (2004) High-order FDTD methods via derivative matching for Maxwell’s equations with material interfaces. J Comput Phys 200:60–103

    Article  MathSciNet  MATH  Google Scholar 

  5. Nguyen A, Satish N, Chhugani J, Kim C, Dubey P (2010) 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 international conference for high performance computing, networking, storage and analysis, New Orleans, Louisiana. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  6. Wellein G, Hager G, Zeiser T, Wittmann M, Fehske H (2009) Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proceedings of the 2009 IEEE international computer software and applications conference, Seattle, Washington. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  7. Williams S, Carter J, Oliker L, Shalf J, Yelick K (2009) Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms. J Parallel Distrib Comput 69:762–777

    Article  Google Scholar 

  8. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 international conference for high performance computing, networking, storage and analysis, Austin, Texas. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  9. Peng L, Seymour R, Nomura K, Kalia RK, Nakano A, Vashishta P, Loddoch A, Netzband M, Volz WR, Wong CC (2009) High-order stencil computations on multicore clusters. In: Proceedings of the 23rd IEEE international parallel and distributed processing symposium, Rome, Italy. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  10. Rivera G, Tseng C-W (2000) Tiling optimizations for 3D scientific computations. In: Proceedings of the 2000 international conference for high performance computing, networking, storage and analysis, Dallas, Texas. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  11. Frigo M, Strumpen V (2005) Cache oblivious stencil computations. In: Proceedings of the 2005 international conference on supercomputing, Cambridge, Massachusetts. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  12. Wonnacott D (2000) Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of the 14th IEEE international parallel and distributed processing symposium, Cancun, Mexico

    Google Scholar 

  13. Renganarayana L, Harthikote-Matha M, Dewri R, Rajopadhye SV (2007) Towards optimal multi-level tiling for stencil computations. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium, Long Beach, California. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  14. Dursun H, Nomura K, Peng L, Seymour R, Wang W, Kalia RK, Nakano A, Vashishta P (2009) A multilevel parallelization framework for high-order stencil computations. In: Proceedings of the 15th international Euro-Par conference on parallel processing, Delft, The Netherlands. Springer, Berlin

    Google Scholar 

  15. Shen G, Cangellaris AC (2007) A new FDTD stencil for reduced numerical anisotropy in the computer modeling of wave phenomena: research articles. Int J RF Microw Comput-Aided Eng 17:447–454

    Article  Google Scholar 

  16. Nakano A, Vashishta P, Kalia RK (1994) Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Comput Phys Commun 83:197–214

    Article  Google Scholar 

  17. Parker M, Ketcham S, Cudney H (2007) Acoustic wave propagation in urban environments. In: Proceedings of the 2007 DoD high performance computing modernization program users group conference, Pittsburgh, Pennsylvania. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  18. Dang DM, Christara CC, Jackson KR (2010) Pricing multi-asset American options on graphics processing units using a PDE approach. In: 3rd workshop on high performance computational finance (WHPCF), in conjunction with the 2010 international conference for high performance computing, networking, storage and analysis, New Orleans, Louisiana. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  19. PARKBENCH: PARallel kernels and BENCHmarks. Available from http://www.netlib.org/parkbench

  20. Bailey D, Barton J, Laninski T, Simon H (1991) The NAS parallel benchmarks. NASA Ames Research Center, Moffett Field

    Google Scholar 

  21. Bromley M, Heller S, McNerney T, Steele JGL (1991) Fortran at ten gigaflops: the connection machine convolution compiler. In: Proceedings of the ACM SIGPLAN 1991 conference on programming language design and implementation, Toronto, Ontario, Canada. ACM, New York

    Google Scholar 

  22. Roth G, Mellor-Crummey J, Kennedy K, Brickner RG (1997) Compiling stencils in high performance Fortran. In: Proceedings of the 1997 international conference for high performance computing, networking, storage and analysis, San Jose, CA. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  23. Bordawekar R, Choudhary A, Ramanujam J (1996) Automatic optimization of communication in compiling out-of-core stencil codes. In: Proceedings of the 1996 international conference for high performance computing, networking, storage and analysis, Philadelphia, Pennsylvania, United States. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  24. Ramanujam J, Krishnamurthy S, Hong J, Kandemir M (2002) Address code and arithmetic optimizations for embedded systems. In: Proceedings of the 2002 conference on Asia south pacific design automation/VLSI design, Bangalore, India. IEEE Comput Soc, Los Alamitos

    Google Scholar 

  25. Shimojo F, Kalia RK, Nakano A, Vashishta P (2008) Divide-and-conquer density functional theory on hierarchical real-space grids: parallel implementation and applications. Phys Rev B, Condens Matter Mater Phys 77:085103

    Article  Google Scholar 

  26. Stathopoulos A, Öğüt S, Saad Y, Chelikowsky JR, Kim H (2000) Parallel methods and tools for predicting material properties. Comput Sci Eng 2:19–32

    Article  Google Scholar 

  27. Snir M, Otto S (1998) MPI: the complete reference: the MPI core. MIT Press, Cambridge

    Google Scholar 

  28. Lam MS, Wolf ME (2004) A data locality optimizing algorithm. ACM SIGPLAN Not 39:442–459

    Article  Google Scholar 

  29. Chen C, Chame J, Hall M (2008) CHiLL: a framework for composing high-level loop transformations. USC computer science technical report

  30. IBM (2008) IBM system Blue Gene Solution: Blue Gene/P application development

Download references

Acknowledgements

This work was partially supported by NSF PetaApps/EMT/CMMI, DOE SciDAC/SciDAC-e/BES/INCITE, and DTRA. Performance tests were carried out at the Collaboratory for Advanced Computing and Simulations and High Performance Computing Center of the University of Southern California. This research also used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hikmet Dursun.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dursun, H., Kunaseth, M., Nomura, Ki. et al. Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters. J Supercomput 62, 946–966 (2012). https://doi.org/10.1007/s11227-012-0764-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-012-0764-z

Keywords

Navigation