Skip to main content

Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor

  • Conference paper
  • First Online:
  • 1815 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Abstract

The ARMv8 64-bit platform has been considered as an alternative for high performance computing (HPC). Stencil computations are a class of iterative kernels which update array elements according to a stencil. In this paper, we evaluate the performance and scalability of one ARMv8 64-bit Multi-Core Processor with 7-point 3D stencil code, and a series of optimization are devised for the stencil code. In the optimization, we mainly focus on how to parallelize the kernel and how to exploit data locality with loop tiling, also we improve the calculation of the block size in tiling. The achieved performance differs with the grid size of stencil, and the optimal performance is 24.4 % of the peak DP Flops for the grid size of \(64^{3}\). Comparing with Intel Xeon processor, the performance of the ARMv8 64-bit processor is about 40 % of that of Sandy Bridge for the stencil code with the grid size of \(512^{3}\), but this ARMv8 64-bit processor shows better scalability.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Rajovic, N., Carpenter, P.M., Gelado, I., Puzovic, N., Ramirez, A., Valero, M.: Supercomputing with commodity CPUs: are mobile SoCs ready for HPC? In: SC 2013: International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. ACM, New York (2013)

    Google Scholar 

  2. Mont-Blanc. http://www.montblanc-project.eu/project/introduction

  3. Rajovic, N., et al.: Building Supercomputers from Mobile Processors. In: EDA Work-shop13 Presentation, Dresden (2013)

    Google Scholar 

  4. Goodacre, J.: The evolution of the arm architecture towards big data and the data-center. In: VHPC 2013: Proceedings of the 8th Workshop on Virtualization in High-Performance Cloud Computing, pp. 1–10. ACM, New York (2013)

    Google Scholar 

  5. Laurenzano, M.A., Tiwari, A., Jundt, A., Peraza, J., Ward Jr., W.A., Campbell, R., Carrington, L.: Characterizing the performance-energy tradeoff of small ARM cores in HPC computation. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 124–137. Springer, Heidelberg (2014)

    Google Scholar 

  6. ARMv8-A Architecture. http://www.arm.com/products/processors/instruction-set-architectures/armv8-architecture.php

  7. ARM Infocenter. http://infocenter.arm.com/help/index.jsp

  8. HiStencils. http://www.exastencils.org/histencils

  9. Stencil code. http://en.wikipedia.org/wiki/Stencil_code

  10. Mccool, M., Reinders, J., Robison, A.: Structured parallel programming: patterns for efficient computation. ACM SIGSOFT Softw. Eng. Notes 37(6), 43 (2012)

    Google Scholar 

  11. The Top 500 list. http://www.top500.org

  12. Edson, L.P., Daniel, A.G.O., Pedro, V., et al.: Scalability and energy efficiency of hpc cluster with arm mpsoc. In: Workshop of Parallel and Distributed Processing (2013)

    Google Scholar 

  13. Rajovic, N., Rico, A., Vipond, J., Gelado, I., Puzovic, N., Ramirez, A.: Experiences with mobile processors for energy efficient HPC. In: DATE 2013: Design, Automation and Test in Europe Conference and Exhibition, pp. 464–468. EDA Consortium, San Jose (2013)

    Google Scholar 

  14. Ou, Z., Pang, B., Deng, Y., Nurminen, J.K., Yla-Jaaski, A., Hui, P.: Energy- and cost-efficiency analysis of ARM-based clusters. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 115–123. IEEE, New York (2012)

    Google Scholar 

  15. Blem, E., Menon, J., Sankaralingam, K.: Power struggles: revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In: HPCA 2013: 19th IEEE International Symposium on High Performance Computer Architecture, pp. 1–12. IEEE Computer Society (2013)

    Google Scholar 

  16. Abdurachmanov, D., Bockelman, B., Elmer, P., Eulisse, G., Knight, R., Muzaffar, S.: Heterogeneous high throughput scientific computing with apm x-gene and intel xeon phi.CoRR.arXiv preprint arXiv:1410.3441 (2014)

  17. Rivera, G., Tseng, C.W.: Tiling optimizations for 3D scientific computations. In: SC Conference, p. 32. IEEE Computer Society (2000)

    Google Scholar 

  18. Song, Y., Xu, R., Wang, C., Li, Z.: Data locality enhancement by memory reduction. In: Proceedings of the 15th International Conference on Supercomputing, pp. 50–64. ACM (2001)

    Google Scholar 

  19. Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 51–60. ACM (2006)

    Google Scholar 

  20. Krishnamoorthy, S., Baskaran, M.M., Bondhugula, U., Ramanujam, J., Rountev A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, pp. 235–244. ACM (2007)

    Google Scholar 

  21. Schäfer, A., Fey, D.: High performance stencil code algorithms for gpgpus. Procedia Comput. Sci. 4, 2027–2036 (2011)

    Article  Google Scholar 

  22. Maruyama, N., Aoki, T.: Optimizing stencil computations for NVIDIA Kepler GPUs. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pp. 89–95 (2014)

    Google Scholar 

  23. Dehnavi, M.M., You, Y., Fu, H., Song, S.L., Gan, L., Huang, X., et al.: Evaluating multi-core and many-core architectures through accelerating the three-dimensional laxCwendroff correction stencil. Int. J. High Perform. Comput. Appl. 28(3), 301–318 (2014)

    Article  Google Scholar 

  24. Rahman, S.M.F., Yi, Q., Qasem, A.: Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, p. 30. ACM (2011)

    Google Scholar 

  25. Chapman, B., Jost, G., Van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming, vol. 10. MIT Press, Cambridge (2008)

    Google Scholar 

  26. Dagum, L., Enon, R.: Openmp: an industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). IEEE

    Article  Google Scholar 

  27. Board, O.A.R.: OpenMP application program interface. version 4.0. The OpenMP Forum, Technical report (2013)

    Google Scholar 

  28. Xue, J.: Loop Tiling for Parallelism. Springer Science & Business Media, US (2000)

    Book  MATH  Google Scholar 

  29. Leopold, C.: Tight bounds on capacity misses for 3D stencil codes. In: Sloot, P.M.A., Tan, C.J.K., Dongarra, J., Hoekstra, A.G. (eds.) ICCS-ComputSci 2002, Part I. LNCS, vol. 2329, pp. 843–852. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

Download references

Acknowledgements

The work in this paper is partially supported by the project of National Science Foundation of China under grant No.61170046, and the National High Technology Research and Development Program of China (863 Program) No.2012AA0 10903.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunjiang Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, C., Dong, Y., Li, K. (2015). Stencil Computations on HPC-oriented ARMv8 64-Bit Multi-Core Processor. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27137-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27136-1

  • Online ISBN: 978-3-319-27137-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics