research-article

On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA

Author:
Ami Marowka

Parallel Reseach Labs, Israel

Parallel Reseach Labs, Israel
View Profile

HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific RegionJanuary 2022Pages 103–114https://doi.org/10.1145/3492805.3492806

Published:07 January 2022Publication History

HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific Region

Pages 103–114

ABSTRACT

Performance Portability frameworks are becoming more central and essential in heterogeneous computing systems. However, the developer toolbox lacks the tools to assess the performance portability degree of these frameworks.

This article presents a new definition and a metric for evaluating the performance portability of high-level parallel programming models. Using the new metric, the performance portability of OpenACC, OpenMP, Kokkos and RAJA were evaluated based on 324 case studies in various application domains, CPUs and GPUs architectures, and high-performance compilers. The results show that the four performance portability frameworks achieve impressive performance portability of over 80% with no significant differences between different architectures and compilers.

References

[1] Sutter H., Welcome to the Jungle, http://herbsutter.com/welcome-to-the-jungle/, 2012.Google Scholar
[2] OpenACC: Directive-Based Parallel Programming Model for Accelerators. Available: http://www.openacc.org (2018).Google Scholar
[3] OpenMP. OpenMP 4.5 Specifications.http://www.openmp.org/specifications/. Accessed: 2017-02-11.Google Scholar
[4] H. Carter Edwards, Christian R. Trott and Daniel Sundrland, Kokkos: Enabling manycore performance portability through polymorphic memory access patterns, Journal of Parallel and Distributed Computing, 2014.Google Scholar
[5] R. D. Hornung, and J. A. Keasler. 2014. The RAJA Portability Layer: Overview and Status. LLNL-TR-661403.Google Scholar
[6] William D. Gropp, Performance, Portability, and Dreams, Dagstuhl Seminar 17431, October 22-27, 2017.Google Scholar
[7] A. Marowka, Pitfalls and Issues of Manycore Programming, Advances in Computers, Volume 79, pages 71-117, 2010.Google Scholar
[8] http://performanceportability.org/perfport/definition/Google Scholar
[9] DOE Centers of Excellence Performance Portability Meeting,April 19-21, 2016, Glendale, AZ, Post-meeting Report.Google Scholar
[10] V. Artigues, K. Kormann, M. Rampp, and K. Reuter. Evaluation of performance portability frameworks for the implementation of a particle-in-cell code. Concurrency Computat. Pract. Exper., page e5640, 2019.Google Scholar
[11]Asahi Y., Latu G., Grandgirard V., Bigot J. (2020) Performance Portable Implementation of a Kinetic Plasma Simulation Mini-App. In: Wienke S., Bhalachandra S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science, vol 12017. Springer, Cham.Google Scholar
[12] Deakin T., Price J., Martineau M., McIntosh-Smith S. (2016) GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. In: Taufer M., Mohr B., Kunkel J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science, vol 9945. Springer, Cham.Google ScholarCross Ref
[13] Eichstaedt J, Vymazal M, Moxey D, Peiro Jet al., 2020, A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM, Computer Physics Communications, Vol: 255, Pages: 1-15.Google Scholar
[14] Gayatri R., Yang C., Kurth T., Deslippe J. (2019) A Case Study for Performance Portability Using OpenMP 4.5. In: Chandrasekaran S., Juckeland G., Wienke S. (eds) Accelerator Programming Using Directives. WACCPD 2018. Lecture Notes in Computer Science, vol 11381. Springer, Cham.Google ScholarCross Ref
[15] J. A. Herdman et al., Accelerating Hydrocodes with OpenACC, OpenCL and CUDA, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, 2012, pp. 465-471.Google Scholar
[16] R. O. Kirk, G. R. Mudalige, I. Z. Reguly, S. A. Wright, M. J. Martineau and S. A. Jarvis, Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems, 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, 2017, pp. 834-841Google ScholarCross Ref
[17] John Gounley, Amanda Randles and Jeffrey S. Vetter, Performance portability study for massively parallel computational fluid dynamics application on scalable heterogeneous architectures. J. Parallel Distributed Comput. 129: 1-13 (2019)Google ScholarDigital Library
[18] M. Martineau, S. McIntosh-Smith and W. Gaudin, Evaluating OpenMP 4.0’s Effectiveness as a Heterogeneous Parallel Programming Model, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, 2016, pp. 338-347.Google ScholarCross Ref
[19] I. Z. Reguly, Performance Portability of Multi-Material Kernels, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Denver, CO, USA, 2019, pp. 26-35.Google ScholarCross Ref
[20] Y. Wei et al., Performance and Portability Studies with OpenACC Accelerated Version of GTC-P, 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), Guangzhou, 2016, pp. 13-18,Google ScholarCross Ref
[21] Sabne A., Sakdhnagool P., Lee S., Vetter J.S. (2015) Evaluating Performance Portability of OpenACC. In: Brodman J., Tu P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science, vol 8967. Springer, Cham.Google Scholar
[22] S. Lee and J. S. Vetter, OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study, 2014 First Workshop on Accelerator Programming using Directives, New Orleans, LA, 2014, pp. 1-11,Google ScholarCross Ref
[23] Balogh G.D., Reguly I.Z., Mudalige G.R. (2018) Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs. In: Jarvis S., Wright S., Hammond S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science, vol 10724. Springer, Cham.Google ScholarCross Ref
[24] Bonati, C., Coscetti, S., D’Elia, M., Mesiti, M., Negro, F., Calore, E., Schifano, S.F., Silvi, G., Tripiccione, R. Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C 2017, 28.Google Scholar
[25] Calore E., Kraus J., Schifano S.F., Tripiccione R. (2015) Accelerating Lattice Boltzmann Applications with OpenACC. In: Traff J., Hunold S., Versaci F. (eds) Euro-Par 2015: Parallel Processing. Euro-Par 2015. Lecture Notes in Computer Science, vol 9233. Springer, Berlin, Heidelberg.Google ScholarCross Ref
[26] Xu R., Tian X., Chandrasekaran S., Yan Y., Chapman B. (2015) NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model. In: Brodman J., Tu P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science, vol 8967. Springer, ChamGoogle Scholar
[27] J. A. Herdman, W. P. Gaudin, O. Perks, D. A. Beckingsale, A. C. Mallinson and S. A. Jarvis, Achieving Portability and Performance through OpenACC, 2014 First Workshop on Accelerator Programming using Directives, New Orleans, LA, 2014, pp. 19-26.Google ScholarDigital Library
[28] Kuan, L., J. Neves, F. Pratas, P. Tomas, and L. Sousa. 2014. Accelerating Phylogenetic Inference on GPUs: An OpenACC and CUDA comparison. 2nd International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO), Granada, SPAIN, April, 07-09. 1: 589-600.Google Scholar
[29] M. G. Lopez et al., Towards Achieving Performance Portability Using Directives for Accelerators, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, UT, 2016, pp. 13-24.Google ScholarCross Ref
[30] M. Martineau, S. McIntosh-Smith, M. Boulton, W. Gaudin, An Evaluation of Emerging Many-Core Parallel Programming Models, 7th International Workshop on Programming Models and Applications for Multicores and Manycores, 2016.Google Scholar
[31] Gong, J., Markidis, S., Laure, E. et al. Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations. J Supercomputing 72, 4160-4180 (2016).Google ScholarDigital Library
[32] T. Hoshino, N. Maruyama, S. Matsuoka and R. Takaki, CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, Delft, 2013, pp. 136-143,Google ScholarDigital Library
[33] A. Lashgar and A. Baniasadi, Employing software-managed caches in OpenACC: Opportunities and benefits, ACM Trans. Model. Perform. Eval. Comput. Syst., vol. 1, no. 1, pp. 2:1-2:34, 2016.Google ScholarDigital Library
[34] Niemeyer, K.E., Sung, C. Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. J Supercomput 67, 528-564 (2014).Google ScholarDigital Library
[35] Norman M, Larkin J, Vose A, et al. (2015) A case study of CUDA FORTRAN and OpenACC for an atmospheric climate kernel. Journal of Computational Science 9: 1-6.Google ScholarCross Ref
[36] Mudalige G.R., Reguly I.Z., Giles M.B., Mallinson A.C., Gaudin W.P., Herdman J.A. (2015) Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems. In: Jarvis S., Wright S., Hammond S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science, vol 8966. Springer, Cham.Google ScholarDigital Library
[37] Hernandez O., Ding W., Chapman B., Kartsaklis C., Sankaran R., Graham R. (2012) Experiences with High-Level Programming Directives for Porting Applications to GPUs. In: Keller R., Kramer D., Weiss JP. (eds) Facing the Multicore - Challenge II. Lecture Notes in Computer Science, vol 7174. Springer, Berlin, Heidelberg.Google ScholarCross Ref
[38] H. C. Edwards and C. R. Trott, Kokkos: Enabling Performance Portability Across Manycore Architectures, 2013 Extreme Scaling Workshop (xsw 2013), Boulder, CO, 2013, pp. 18-24.Google ScholarDigital Library
[39] A. Hayashi, J. Shirako, E. Tiotto, R. Ho and V. Sarkar, Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, UT, 2016, pp. 68-78.Google ScholarCross Ref
[40] A. Hsu, D. N. Asanza, J. A. Schoonover, Z. Jibben, N. N. Carlson and R. Robey, Performance Portability Challenges for Fortran Applications, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, 2018, pp. 47-58.Google ScholarCross Ref
[41] Law, T.R., Kevis, R., Powell, S., Dickson, J., Maheswaran, S., Herdman, J.A., Jarvis, S.A.: Performance portability of an unstructured hydrodynamics mini-application. In: Proceedings of 2018 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC). ACM, New York, NY, USA (2018).Google Scholar
[42] Martineau M., McIntosh-Smith S. (2017) The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs.In: de Supinski B., Olivier S., Terboven C., Chapman B., M?ller M. (eds) Scaling OpenMP for Exascale Performance and Portability. IWOMP 2017. Lecture Notes in Computer Science, vol 10468. Springer, Cham.Google Scholar
[43] Martineau M., Price J., McIntosh-Smith S., Gaudin W. (2016) Pragmatic Performance Portability with OpenMP 4.x. In: Maruyama N., de Supinski B., Wahib M. (eds) OpenMP: Memory, Devices, and Tasks. IWOMP 2016. Lecture Notes in Computer Science, vol 9903. Springer, Cham.Google ScholarCross Ref
[44] S. J. Pennycook, J. D. Sewall and J. R. Hammond, Evaluating the Impact of Proposed OpenMP 5.0 Features on Performance, Portability and Productivity, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, 2018, pp. 37-46Google Scholar
[45] S. L. Harrell et al., Effective Performance Portability,” 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, 2018, pp. 24-36.Google ScholarCross Ref
[46] Tandon Suyash, N. Stegmeier, Vasu Jaganath, Jennifer Ranta, R. Ratnasingam, Elizabeth Carlson, J. Loiseau, Vinay Ramakrishnaiah and Robert S. Pavel. Enabling code portability of a parallel and distributed smooth-particle hydrodynamics application, FleCSPH. (2019).Google Scholar
[47] T. Hey, J. Ferrante (Eds.), Portability and Performance of Parallel Processing, Wiley, New York, 1994.Google ScholarDigital Library
[48] Bowen Alpern and Larry Carter, Towards a Model for Portable Parallel Performance: Exposing the Memory Hierarchy,In T. Hey, J. Ferrante (Eds.), Portability and Performance of Parallel Processing, Wiley, New York, 1994, pp. 21-41.Google Scholar
[49] S. J. Pennycook, J. D. Sewall, and V. W. Lee, A Metric for Performance Portability, arXiv preprint arXiv:1611.07409, 2016.Google Scholar
[50] Ami Marowka, Toward a Better Performance Portability Metric, In Proceeding of 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2021), Valladolid, Spain, March 10-12, 2021.Google Scholar
[51] Ami Marowka, Raw Data and Statistics of case studies for Performance Portability Research,https://www.dropbox.com/s/1g9q0s2ymqq9003/Zmy.pdf?dl=0Google Scholar
[52] Khronos Steps Towards Widespread Deployment of SYCL with Release of SYCL 2020 Provisional Specification, 2020. [Online]. Available: https://www.khronos.org/news/press/Google Scholar
[53] https://www.oneapi.io/Google Scholar

Index Terms

On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA

Index terms have been assigned to the content through auto-classification.

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Read More
Understanding Performance Portability of OpenACC for Supercomputers
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Scientific applications need to be moved among supercomputers, such as Tianhe-2 and TSUBAME 2.5. OpenACC provides a directive-based approach for a single source code base with function portability across different accelerators used in the ...
Read More
An overview of performance portability in the uintah runtime system through the use of kokkos
ESPM2: Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and Middleware

The current diversity in nodal parallel computer architectures is seen in machines based upon multicore CPUs, GPUs and the Intel Xeon Phi's. A class of approaches for enabling scalability of complex applications on such architectures is based upon ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific Region
January 2022
145 pages
ISBN:9781450384988
DOI:10.1145/3492805

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 January 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Kokkos
OpenACC
OpenMP
Performance Efficiency
Performance Portability
RAJA
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate69of143submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 419
  Total Downloads
- Downloads (Last 12 months)153
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA

HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific Region

ABSTRACT

References

Cited By

Index Terms

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

Understanding Performance Portability of OpenACC for Supercomputers

An overview of performance portability in the uintah runtime system through the use of kokkos

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA

HPCAsia '22: International Conference on High Performance Computing in Asia-Pacific Region

ABSTRACT

References

Cited By

Index Terms

Recommendations

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption

Understanding Performance Portability of OpenACC for Supercomputers

An overview of performance portability in the uintah runtime system through the use of kokkos

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media