Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor

Li, Chunjiang; Dong, Yushan; Li, Kuan

doi:10.1007/978-3-319-27137-8_3

Chunjiang Li¹⁷,
Yushan Dong¹⁷ &
Kuan Li¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1857 Accesses

Abstract

The ARMv8 64-bit platform has been considered as an alternative for high performance computing (HPC). Stencil computations are a class of iterative kernels which update array elements according to a stencil. In this paper, we evaluate the performance and scalability of one ARMv8 64-bit Multi-Core Processor with 7-point 3D stencil code, and a series of optimization are devised for the stencil code. In the optimization, we mainly focus on how to parallelize the kernel and how to exploit data locality with loop tiling, also we improve the calculation of the block size in tiling. The achieved performance differs with the grid size of stencil, and the optimal performance is 24.4 % of the peak DP Flops for the grid size of $64^{3}$. Comparing with Intel Xeon processor, the performance of the ARMv8 64-bit processor is about 40 % of that of Sandy Bridge for the stencil code with the grid size of $512^{3}$, but this ARMv8 64-bit processor shows better scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Article 28 April 2021

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

References

Rajovic, N., Carpenter, P.M., Gelado, I., Puzovic, N., Ramirez, A., Valero, M.: Supercomputing with commodity CPUs: are mobile SoCs ready for HPC? In: SC 2013: International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. ACM, New York (2013)
Google Scholar
Mont-Blanc. http://www.montblanc-project.eu/project/introduction
Rajovic, N., et al.: Building Supercomputers from Mobile Processors. In: EDA Work-shop13 Presentation, Dresden (2013)
Google Scholar
Goodacre, J.: The evolution of the arm architecture towards big data and the data-center. In: VHPC 2013: Proceedings of the 8th Workshop on Virtualization in High-Performance Cloud Computing, pp. 1–10. ACM, New York (2013)
Google Scholar
Laurenzano, M.A., Tiwari, A., Jundt, A., Peraza, J., Ward Jr., W.A., Campbell, R., Carrington, L.: Characterizing the performance-energy tradeoff of small ARM cores in HPC computation. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 124–137. Springer, Heidelberg (2014)
Google Scholar
ARMv8-A Architecture. http://www.arm.com/products/processors/instruction-set-architectures/armv8-architecture.php
ARM Infocenter. http://infocenter.arm.com/help/index.jsp
HiStencils. http://www.exastencils.org/histencils
Stencil code. http://en.wikipedia.org/wiki/Stencil_code
Mccool, M., Reinders, J., Robison, A.: Structured parallel programming: patterns for efficient computation. ACM SIGSOFT Softw. Eng. Notes 37(6), 43 (2012)
Google Scholar
The Top 500 list. http://www.top500.org
Edson, L.P., Daniel, A.G.O., Pedro, V., et al.: Scalability and energy efficiency of hpc cluster with arm mpsoc. In: Workshop of Parallel and Distributed Processing (2013)
Google Scholar
Rajovic, N., Rico, A., Vipond, J., Gelado, I., Puzovic, N., Ramirez, A.: Experiences with mobile processors for energy efficient HPC. In: DATE 2013: Design, Automation and Test in Europe Conference and Exhibition, pp. 464–468. EDA Consortium, San Jose (2013)
Google Scholar
Ou, Z., Pang, B., Deng, Y., Nurminen, J.K., Yla-Jaaski, A., Hui, P.: Energy- and cost-efficiency analysis of ARM-based clusters. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 115–123. IEEE, New York (2012)
Google Scholar
Blem, E., Menon, J., Sankaralingam, K.: Power struggles: revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In: HPCA 2013: 19th IEEE International Symposium on High Performance Computer Architecture, pp. 1–12. IEEE Computer Society (2013)
Google Scholar
Abdurachmanov, D., Bockelman, B., Elmer, P., Eulisse, G., Knight, R., Muzaffar, S.: Heterogeneous high throughput scientific computing with apm x-gene and intel xeon phi.CoRR.arXiv preprint arXiv:1410.3441 (2014)
Rivera, G., Tseng, C.W.: Tiling optimizations for 3D scientific computations. In: SC Conference, p. 32. IEEE Computer Society (2000)
Google Scholar
Song, Y., Xu, R., Wang, C., Li, Z.: Data locality enhancement by memory reduction. In: Proceedings of the 15th International Conference on Supercomputing, pp. 50–64. ACM (2001)
Google Scholar
Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 51–60. ACM (2006)
Google Scholar
Krishnamoorthy, S., Baskaran, M.M., Bondhugula, U., Ramanujam, J., Rountev A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, pp. 235–244. ACM (2007)
Google Scholar
Schäfer, A., Fey, D.: High performance stencil code algorithms for gpgpus. Procedia Comput. Sci. 4, 2027–2036 (2011)
Article Google Scholar
Maruyama, N., Aoki, T.: Optimizing stencil computations for NVIDIA Kepler GPUs. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pp. 89–95 (2014)
Google Scholar
Dehnavi, M.M., You, Y., Fu, H., Song, S.L., Gan, L., Huang, X., et al.: Evaluating multi-core and many-core architectures through accelerating the three-dimensional laxCwendroff correction stencil. Int. J. High Perform. Comput. Appl. 28(3), 301–318 (2014)
Article Google Scholar
Rahman, S.M.F., Yi, Q., Qasem, A.: Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, p. 30. ACM (2011)
Google Scholar
Chapman, B., Jost, G., Van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming, vol. 10. MIT Press, Cambridge (2008)
Google Scholar
Dagum, L., Enon, R.: Openmp: an industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). IEEE
Article Google Scholar
Board, O.A.R.: OpenMP application program interface. version 4.0. The OpenMP Forum, Technical report (2013)
Google Scholar
Xue, J.: Loop Tiling for Parallelism. Springer Science & Business Media, US (2000)
Book MATH Google Scholar
Leopold, C.: Tight bounds on capacity misses for 3D stencil codes. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS-ComputSci 2002, Part I. LNCS, vol. 2329, pp. 843–852. Springer, Heidelberg (2002)
Chapter Google Scholar

Download references

Acknowledgements

The work in this paper is partially supported by the project of National Science Foundation of China under grant No.61170046, and the National High Technology Research and Development Program of China (863 Program) No.2012AA0 10903.

Author information

Authors and Affiliations

School of Computer, National University of Defence Technology, Changsha, 410073, China
Chunjiang Li, Yushan Dong & Kuan Li

Authors

Chunjiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yushan Dong
View author publications
You can also search for this author in PubMed Google Scholar
Kuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunjiang Li .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University, Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C., Dong, Y., Li, K. (2015). Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-27137-8_3
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27136-1
Online ISBN: 978-3-319-27137-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

High Performance Stencil Computations for Intel $$^{\normalsize \circledR }$$ Xeon Phi™ Coprocessor

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation