Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Zhang, Kaifang; Su, Huayou; Dou, Yong

doi:10.1007/s11227-021-03823-3

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Published: 28 April 2021

Volume 77, pages 13584–13600, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kaifang Zhang¹,
Huayou Su¹ &
Yong Dou¹

281 Accesses
4 Citations
Explore all metrics

Abstract

Stencil computations within a single core or multicores of an SMP node have been over-investigated. However, the demands on HPC’s higher performance and the rapidly increasing number of cores in modern processors pose new challenges for program developers. These cores are typically organized as several NUMA nodes, which are characterized by remote memory across nodes and local memory with uniform memory access within each node. In this paper, we conducted experiments of stencil computations on NUMA systems based on the two most typical processors, ARM and Intel Xeon E5. We leverage a hybrid programming approach by combining MPI and OpenMP to exploit the potential benefits among NUMA nodes and within a NUMA node. Optimizations of the two selected 3D stencil computations involve four-level parallelism: block decomposition for NUMA nodes and processes, thread-level parallelism within a NUMA node, and data-level parallelism within a thread based on SIMD extension. Experimental results show that we obtain a maximum speedup of 7.27\({\times }\) compared to the pure OpenMP implementations on the ARM platform and 11.68\({\times }\) on the Intel platform.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NVIDIA SimNet™: An AI-Accelerated Multi-Physics Simulation Framework

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Inmaculada Santamaria-Valenzuela, Rocío Carratalá-Sáez, … Arturo Gonzalez-Escribano

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

References

The top-500 list of supercomputer sites. Available from: http://www.top500.org/lists/
Lameter C (2006) Local and remote memory: memory in a Linux or NUMA System. Linux symposium
Chandra R, Menon R, Dagum L, Kohr D, Maydan D, McDonald J (2000) Parallel Programming in OpenMP. Morgan Kaufman, San Francisco
Google Scholar
Pacheco PS (1997) Parallel Programming with MPI. Morgan Kaufman, San Francisco
MATH Google Scholar
Gropp W, Lusk E, Skjellum A (1994) Using MPI: Portable Parallel Programming with the Message-Passing Interface. MPI Press, Cambridge
MATH Google Scholar
Jin Haoqiang, Jespersen Dennis, Mehrotra Piyush, Biswas Rupak, Huang Lei, Chapman Barbara (2011) High performance computing using MPI and OpenMP on multi-core parallel systems. Parallel Comput. 37:562–575. https://doi.org/10.1016/j.parco.2011.02.002
Article Google Scholar
Uday Bondhugula (2013) Compiling affine loop nests for distributed-memory parallel architectures. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1145-2503210.2503289
Optimizing applications for NUMA. Development topics and technologies, Intel. Published: 11/02/2011, Last Updated: 11/02/2011. https://software.intel.com/content/www/us/en/develop/articles/optimizing-applications-for-numa.html
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, Katherine A (2009) Yelick: optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1):129–159
Article Google Scholar
Henretty T, Veras R, Franchetti F, Pouchet L-N, Ramanujam J, Sadayappan P (2013) A stencil compiler for short-vector SIMD architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS'13). Association for Computing Machinery, New York, NY, USA, pp 13–24. https://doi.org/10.1145/2464996.2467268
Tom Henretty (2014) Performance optimization of stencil computations on modern SIMD architectures[Ph.D]
Armejach Adriá, Caminal Helena, Cebrian Juan, Langarita Rubėn, González-Alberquilla Rekai, Adeniyi-Jones Chris, Valero Mateo, Casas Marc, Miquel Moreto (2019) Using Arm’s scalable vector extension on stencil codes. J Supercomput. https://doi.org/10.1007/s11227-019-02842-5
Article Google Scholar
Jang M, Kim K, Kim K (2011) The performance analysis of ARM NEON technology for mobile platforms. In: RACS'11: Proceedings of the 2011 ACM symposium on research in applied computation, pp 104–106. https://doi.org/10.1145/2103380.2103401
Elena L, Arseny T, Dmitry N, Vladimir A (2020) Fast implementation of morphological filtering using ARM NEON extension. arXiv:2002.09474
Mariem Saied (2016) Jens Gustedt. Automatic Code Generation for Iterative Multi-dimensional Stencil Computations. HiPC, Gilles Muller, pp 280–289
Google Scholar
Saied, Mariem (2018) Automatic code generation and optimization of multi-dimensional stencil computations on distributed-memory architectures
Grid5000. Available from: https://www.grid5000.fr/
Kaushik Datta, Samuel Williams, Vasily Volkov, et al (2009) Auto-tuning the 27-point stencil for multicore[J]. Proc Iwapt the Fourth International Workshop on Automatic Performance Tuning
Kaiser Timothy, Baden Scott (2001) Overlapping communication and computation with OpenMP and MPI. Sci. Programm. 9:73–81. https://doi.org/10.1155/2001/712152
Article Google Scholar
Dathathri Roshan, Reddy Chandan, Ramashekar Thejas, Bondhugula Uday (2013) Generating efficient data movement code for heterogeneous architectures with distributed-memory. Parallel Architectures and Compilation Techniques-Conference Proceedings, PACT. 375–386. https://doi.org/10.1109/PACT.2013.6618833
Bondhugula Uday, Baskaran Muthu, Krishnamo Orthy Sriram, Ramanujam J., Rountev Atanas, and Sadayappan Ponnuswamy (2008) Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. International Conference on Compiler Construction. 4959. 132-146
Rabenseifner Rolf, Hager Georg, and Jost Gabriele (2009) Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. Euromicro International Conference on Parallel, Distributed and Network-based Processing 427-436
Kamil DK et al (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev 51(1):129–159
Article Google Scholar
Bandishti V., Pananilath I., Bondhugula Uday (2012) Tiling stencil computations to maximize parallelism[C]\(//\) International Conference on High Performance Computing. IEEE Computer Society Press
Wellein G., Hager G., Zeiser T., et al (2009) Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization[C]\(//\) IEEE International Computer Software and Applications Conference. IEEE
Datta K, Murphy M, Volkov V et al (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC'08). IEEE Press, Article 4, pp 1–12

Download references

Acknowledgements

The work is supported by National Key Research and Development Program of China (2018YFB0204301) and Open Fund of PDL (6142110190201). We would also like to thank Chaorun Liu (NUDT), Peng Zhang (CAEP), Song Liu (XJTU), and the reviewers for their remarkable comments on the work.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, China
Kaifang Zhang, Huayou Su & Yong Dou

Authors

Kaifang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Huayou Su
View author publications
You can also search for this author in PubMed Google Scholar
Yong Dou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaifang Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, K., Su, H. & Dou, Y. Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures. J Supercomput 77, 13584–13600 (2021). https://doi.org/10.1007/s11227-021-03823-3

Download citation

Accepted: 17 April 2021
Published: 28 April 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11227-021-03823-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Abstract

Access this article

Similar content being viewed by others

NVIDIA SimNet™: An AI-Accelerated Multi-Physics Simulation Framework

Performance improvement of the triangular matrix product in commodity clusters

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Abstract

Access this article

Similar content being viewed by others

NVIDIA SimNet™: An AI-Accelerated Multi-Physics Simulation Framework

Performance improvement of the triangular matrix product in commodity clusters

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation