Skip to main content
Log in

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Stencil computations within a single core or multicores of an SMP node have been over-investigated. However, the demands on HPC’s higher performance and the rapidly increasing number of cores in modern processors pose new challenges for program developers. These cores are typically organized as several NUMA nodes, which are characterized by remote memory across nodes and local memory with uniform memory access within each node. In this paper, we conducted experiments of stencil computations on NUMA systems based on the two most typical processors, ARM and Intel Xeon E5. We leverage a hybrid programming approach by combining MPI and OpenMP to exploit the potential benefits among NUMA nodes and within a NUMA node. Optimizations of the two selected 3D stencil computations involve four-level parallelism: block decomposition for NUMA nodes and processes, thread-level parallelism within a NUMA node, and data-level parallelism within a thread based on SIMD extension. Experimental results show that we obtain a maximum speedup of 7.27\({\times }\) compared to the pure OpenMP implementations on the ARM platform and 11.68\({\times }\) on the Intel platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. The top-500 list of supercomputer sites. Available from: http://www.top500.org/lists/

  2. Lameter C (2006) Local and remote memory: memory in a Linux or NUMA System. Linux symposium

  3. Chandra R, Menon R, Dagum L, Kohr D, Maydan D, McDonald J (2000) Parallel Programming in OpenMP. Morgan Kaufman, San Francisco

    Google Scholar 

  4. Pacheco PS (1997) Parallel Programming with MPI. Morgan Kaufman, San Francisco

    MATH  Google Scholar 

  5. Gropp W, Lusk E, Skjellum A (1994) Using MPI: Portable Parallel Programming with the Message-Passing Interface. MPI Press, Cambridge

    MATH  Google Scholar 

  6. Jin Haoqiang, Jespersen Dennis, Mehrotra Piyush, Biswas Rupak, Huang Lei, Chapman Barbara (2011) High performance computing using MPI and OpenMP on multi-core parallel systems. Parallel Comput. 37:562–575. https://doi.org/10.1016/j.parco.2011.02.002

    Article  Google Scholar 

  7. Uday Bondhugula (2013) Compiling affine loop nests for distributed-memory parallel architectures. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1145-2503210.2503289

  8. Optimizing applications for NUMA. Development topics and technologies, Intel. Published: 11/02/2011, Last Updated: 11/02/2011. https://software.intel.com/content/www/us/en/develop/articles/optimizing-applications-for-numa.html

  9. Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, Katherine A (2009) Yelick: optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1):129–159

    Article  Google Scholar 

  10. Henretty T, Veras R, Franchetti F, Pouchet L-N, Ramanujam J, Sadayappan P (2013) A stencil compiler for short-vector SIMD architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS'13). Association for Computing Machinery, New York, NY, USA, pp 13–24. https://doi.org/10.1145/2464996.2467268

  11. Tom Henretty (2014) Performance optimization of stencil computations on modern SIMD architectures[Ph.D]

  12. Armejach Adriá, Caminal Helena, Cebrian Juan, Langarita Rubėn, González-Alberquilla Rekai, Adeniyi-Jones Chris, Valero Mateo, Casas Marc, Miquel Moreto (2019) Using Arm’s scalable vector extension on stencil codes. J Supercomput. https://doi.org/10.1007/s11227-019-02842-5

    Article  Google Scholar 

  13. Jang M, Kim K, Kim K (2011) The performance analysis of ARM NEON technology for mobile platforms. In: RACS'11: Proceedings of the 2011 ACM symposium on research in applied computation, pp 104–106. https://doi.org/10.1145/2103380.2103401

  14. Elena L, Arseny T, Dmitry N, Vladimir A (2020) Fast implementation of morphological filtering using ARM NEON extension. arXiv:2002.09474

  15. Mariem Saied (2016) Jens Gustedt. Automatic Code Generation for Iterative Multi-dimensional Stencil Computations. HiPC, Gilles Muller, pp 280–289

    Google Scholar 

  16. Saied, Mariem (2018) Automatic code generation and optimization of multi-dimensional stencil computations on distributed-memory architectures

  17. Grid5000. Available from: https://www.grid5000.fr/

  18. Kaushik Datta, Samuel Williams, Vasily Volkov, et al (2009) Auto-tuning the 27-point stencil for multicore[J]. Proc Iwapt the Fourth International Workshop on Automatic Performance Tuning

  19. Kaiser Timothy, Baden Scott (2001) Overlapping communication and computation with OpenMP and MPI. Sci. Programm. 9:73–81. https://doi.org/10.1155/2001/712152

    Article  Google Scholar 

  20. Dathathri Roshan, Reddy Chandan, Ramashekar Thejas, Bondhugula Uday (2013) Generating efficient data movement code for heterogeneous architectures with distributed-memory. Parallel Architectures and Compilation Techniques-Conference Proceedings, PACT. 375–386. https://doi.org/10.1109/PACT.2013.6618833

  21. Bondhugula Uday, Baskaran Muthu, Krishnamo Orthy Sriram, Ramanujam J., Rountev Atanas, and Sadayappan Ponnuswamy (2008) Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. International Conference on Compiler Construction. 4959. 132-146

  22. Rabenseifner Rolf, Hager Georg, and Jost Gabriele (2009) Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. Euromicro International Conference on Parallel, Distributed and Network-based Processing 427-436

  23. Kamil DK et al (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev 51(1):129–159

    Article  Google Scholar 

  24. Bandishti V., Pananilath I., Bondhugula Uday (2012) Tiling stencil computations to maximize parallelism[C]\(//\) International Conference on High Performance Computing. IEEE Computer Society Press

  25. Wellein G., Hager G., Zeiser T., et al (2009) Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization[C]\(//\) IEEE International Computer Software and Applications Conference. IEEE

  26. Datta K, Murphy M, Volkov V et al (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC'08). IEEE Press, Article 4, pp 1–12

Download references

Acknowledgements

The work is supported by National Key Research and Development Program of China (2018YFB0204301) and Open Fund of PDL (6142110190201). We would also like to thank Chaorun Liu (NUDT), Peng Zhang (CAEP), Song Liu (XJTU), and the reviewers for their remarkable comments on the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kaifang Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, K., Su, H. & Dou, Y. Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures. J Supercomput 77, 13584–13600 (2021). https://doi.org/10.1007/s11227-021-03823-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03823-3

Keywords

Navigation