ABSTRACT
Performance Portability frameworks are becoming more central and essential in heterogeneous computing systems. However, the developer toolbox lacks the tools to assess the performance portability degree of these frameworks.
This article presents a new definition and a metric for evaluating the performance portability of high-level parallel programming models. Using the new metric, the performance portability of OpenACC, OpenMP, Kokkos and RAJA were evaluated based on 324 case studies in various application domains, CPUs and GPUs architectures, and high-performance compilers. The results show that the four performance portability frameworks achieve impressive performance portability of over 80% with no significant differences between different architectures and compilers.
- [1] Sutter H., Welcome to the Jungle, http://herbsutter.com/welcome-to-the-jungle/, 2012.Google Scholar
- [2] OpenACC: Directive-Based Parallel Programming Model for Accelerators. Available: http://www.openacc.org (2018).Google Scholar
- [3] OpenMP. OpenMP 4.5 Specifications.http://www.openmp.org/specifications/. Accessed: 2017-02-11.Google Scholar
- [4] H. Carter Edwards, Christian R. Trott and Daniel Sundrland, Kokkos: Enabling manycore performance portability through polymorphic memory access patterns, Journal of Parallel and Distributed Computing, 2014.Google Scholar
- [5] R. D. Hornung, and J. A. Keasler. 2014. The RAJA Portability Layer: Overview and Status. LLNL-TR-661403.Google Scholar
- [6] William D. Gropp, Performance, Portability, and Dreams, Dagstuhl Seminar 17431, October 22-27, 2017.Google Scholar
- [7] A. Marowka, Pitfalls and Issues of Manycore Programming, Advances in Computers, Volume 79, pages 71-117, 2010.Google Scholar
- [8] http://performanceportability.org/perfport/definition/Google Scholar
- [9] DOE Centers of Excellence Performance Portability Meeting,April 19-21, 2016, Glendale, AZ, Post-meeting Report.Google Scholar
- [10] V. Artigues, K. Kormann, M. Rampp, and K. Reuter. Evaluation of performance portability frameworks for the implementation of a particle-in-cell code. Concurrency Computat. Pract. Exper., page e5640, 2019.Google Scholar
- [11]Asahi Y., Latu G., Grandgirard V., Bigot J. (2020) Performance Portable Implementation of a Kinetic Plasma Simulation Mini-App. In: Wienke S., Bhalachandra S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science, vol 12017. Springer, Cham.Google Scholar
- [12] Deakin T., Price J., Martineau M., McIntosh-Smith S. (2016) GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. In: Taufer M., Mohr B., Kunkel J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science, vol 9945. Springer, Cham.Google ScholarCross Ref
- [13] Eichstaedt J, Vymazal M, Moxey D, Peiro Jet al., 2020, A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM, Computer Physics Communications, Vol: 255, Pages: 1-15.Google Scholar
- [14] Gayatri R., Yang C., Kurth T., Deslippe J. (2019) A Case Study for Performance Portability Using OpenMP 4.5. In: Chandrasekaran S., Juckeland G., Wienke S. (eds) Accelerator Programming Using Directives. WACCPD 2018. Lecture Notes in Computer Science, vol 11381. Springer, Cham.Google ScholarCross Ref
- [15] J. A. Herdman et al., Accelerating Hydrocodes with OpenACC, OpenCL and CUDA, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, 2012, pp. 465-471.Google Scholar
- [16] R. O. Kirk, G. R. Mudalige, I. Z. Reguly, S. A. Wright, M. J. Martineau and S. A. Jarvis, Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems, 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, 2017, pp. 834-841Google ScholarCross Ref
- [17] John Gounley, Amanda Randles and Jeffrey S. Vetter, Performance portability study for massively parallel computational fluid dynamics application on scalable heterogeneous architectures. J. Parallel Distributed Comput. 129: 1-13 (2019)Google ScholarDigital Library
- [18] M. Martineau, S. McIntosh-Smith and W. Gaudin, Evaluating OpenMP 4.0’s Effectiveness as a Heterogeneous Parallel Programming Model, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, 2016, pp. 338-347.Google ScholarCross Ref
- [19] I. Z. Reguly, Performance Portability of Multi-Material Kernels, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Denver, CO, USA, 2019, pp. 26-35.Google ScholarCross Ref
- [20] Y. Wei et al., Performance and Portability Studies with OpenACC Accelerated Version of GTC-P, 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), Guangzhou, 2016, pp. 13-18,Google ScholarCross Ref
- [21] Sabne A., Sakdhnagool P., Lee S., Vetter J.S. (2015) Evaluating Performance Portability of OpenACC. In: Brodman J., Tu P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science, vol 8967. Springer, Cham.Google Scholar
- [22] S. Lee and J. S. Vetter, OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study, 2014 First Workshop on Accelerator Programming using Directives, New Orleans, LA, 2014, pp. 1-11,Google ScholarCross Ref
- [23] Balogh G.D., Reguly I.Z., Mudalige G.R. (2018) Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs. In: Jarvis S., Wright S., Hammond S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science, vol 10724. Springer, Cham.Google ScholarCross Ref
- [24] Bonati, C., Coscetti, S., D’Elia, M., Mesiti, M., Negro, F., Calore, E., Schifano, S.F., Silvi, G., Tripiccione, R. Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C 2017, 28.Google Scholar
- [25] Calore E., Kraus J., Schifano S.F., Tripiccione R. (2015) Accelerating Lattice Boltzmann Applications with OpenACC. In: Traff J., Hunold S., Versaci F. (eds) Euro-Par 2015: Parallel Processing. Euro-Par 2015. Lecture Notes in Computer Science, vol 9233. Springer, Berlin, Heidelberg.Google ScholarCross Ref
- [26] Xu R., Tian X., Chandrasekaran S., Yan Y., Chapman B. (2015) NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model. In: Brodman J., Tu P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science, vol 8967. Springer, ChamGoogle Scholar
- [27] J. A. Herdman, W. P. Gaudin, O. Perks, D. A. Beckingsale, A. C. Mallinson and S. A. Jarvis, Achieving Portability and Performance through OpenACC, 2014 First Workshop on Accelerator Programming using Directives, New Orleans, LA, 2014, pp. 19-26.Google ScholarDigital Library
- [28] Kuan, L., J. Neves, F. Pratas, P. Tomas, and L. Sousa. 2014. Accelerating Phylogenetic Inference on GPUs: An OpenACC and CUDA comparison. 2nd International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO), Granada, SPAIN, April, 07-09. 1: 589-600.Google Scholar
- [29] M. G. Lopez et al., Towards Achieving Performance Portability Using Directives for Accelerators, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, UT, 2016, pp. 13-24.Google ScholarCross Ref
- [30] M. Martineau, S. McIntosh-Smith, M. Boulton, W. Gaudin, An Evaluation of Emerging Many-Core Parallel Programming Models, 7th International Workshop on Programming Models and Applications for Multicores and Manycores, 2016.Google Scholar
- [31] Gong, J., Markidis, S., Laure, E. et al. Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations. J Supercomputing 72, 4160-4180 (2016).Google ScholarDigital Library
- [32] T. Hoshino, N. Maruyama, S. Matsuoka and R. Takaki, CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, Delft, 2013, pp. 136-143,Google ScholarDigital Library
- [33] A. Lashgar and A. Baniasadi, Employing software-managed caches in OpenACC: Opportunities and benefits, ACM Trans. Model. Perform. Eval. Comput. Syst., vol. 1, no. 1, pp. 2:1-2:34, 2016.Google ScholarDigital Library
- [34] Niemeyer, K.E., Sung, C. Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. J Supercomput 67, 528-564 (2014).Google ScholarDigital Library
- [35] Norman M, Larkin J, Vose A, et al. (2015) A case study of CUDA FORTRAN and OpenACC for an atmospheric climate kernel. Journal of Computational Science 9: 1-6.Google ScholarCross Ref
- [36] Mudalige G.R., Reguly I.Z., Giles M.B., Mallinson A.C., Gaudin W.P., Herdman J.A. (2015) Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems. In: Jarvis S., Wright S., Hammond S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science, vol 8966. Springer, Cham.Google ScholarDigital Library
- [37] Hernandez O., Ding W., Chapman B., Kartsaklis C., Sankaran R., Graham R. (2012) Experiences with High-Level Programming Directives for Porting Applications to GPUs. In: Keller R., Kramer D., Weiss JP. (eds) Facing the Multicore - Challenge II. Lecture Notes in Computer Science, vol 7174. Springer, Berlin, Heidelberg.Google ScholarCross Ref
- [38] H. C. Edwards and C. R. Trott, Kokkos: Enabling Performance Portability Across Manycore Architectures, 2013 Extreme Scaling Workshop (xsw 2013), Boulder, CO, 2013, pp. 18-24.Google ScholarDigital Library
- [39] A. Hayashi, J. Shirako, E. Tiotto, R. Ho and V. Sarkar, Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, UT, 2016, pp. 68-78.Google ScholarCross Ref
- [40] A. Hsu, D. N. Asanza, J. A. Schoonover, Z. Jibben, N. N. Carlson and R. Robey, Performance Portability Challenges for Fortran Applications, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, 2018, pp. 47-58.Google ScholarCross Ref
- [41] Law, T.R., Kevis, R., Powell, S., Dickson, J., Maheswaran, S., Herdman, J.A., Jarvis, S.A.: Performance portability of an unstructured hydrodynamics mini-application. In: Proceedings of 2018 International Workshop on Performance, Portability, and Productivity in HPC (P3HPC). ACM, New York, NY, USA (2018).Google Scholar
- [42] Martineau M., McIntosh-Smith S. (2017) The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs.In: de Supinski B., Olivier S., Terboven C., Chapman B., M?ller M. (eds) Scaling OpenMP for Exascale Performance and Portability. IWOMP 2017. Lecture Notes in Computer Science, vol 10468. Springer, Cham.Google Scholar
- [43] Martineau M., Price J., McIntosh-Smith S., Gaudin W. (2016) Pragmatic Performance Portability with OpenMP 4.x. In: Maruyama N., de Supinski B., Wahib M. (eds) OpenMP: Memory, Devices, and Tasks. IWOMP 2016. Lecture Notes in Computer Science, vol 9903. Springer, Cham.Google ScholarCross Ref
- [44] S. J. Pennycook, J. D. Sewall and J. R. Hammond, Evaluating the Impact of Proposed OpenMP 5.0 Features on Performance, Portability and Productivity, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, 2018, pp. 37-46Google Scholar
- [45] S. L. Harrell et al., Effective Performance Portability,” 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, 2018, pp. 24-36.Google ScholarCross Ref
- [46] Tandon Suyash, N. Stegmeier, Vasu Jaganath, Jennifer Ranta, R. Ratnasingam, Elizabeth Carlson, J. Loiseau, Vinay Ramakrishnaiah and Robert S. Pavel. Enabling code portability of a parallel and distributed smooth-particle hydrodynamics application, FleCSPH. (2019).Google Scholar
- [47] T. Hey, J. Ferrante (Eds.), Portability and Performance of Parallel Processing, Wiley, New York, 1994.Google ScholarDigital Library
- [48] Bowen Alpern and Larry Carter, Towards a Model for Portable Parallel Performance: Exposing the Memory Hierarchy,In T. Hey, J. Ferrante (Eds.), Portability and Performance of Parallel Processing, Wiley, New York, 1994, pp. 21-41.Google Scholar
- [49] S. J. Pennycook, J. D. Sewall, and V. W. Lee, A Metric for Performance Portability, arXiv preprint arXiv:1611.07409, 2016.Google Scholar
- [50] Ami Marowka, Toward a Better Performance Portability Metric, In Proceeding of 29th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2021), Valladolid, Spain, March 10-12, 2021.Google Scholar
- [51] Ami Marowka, Raw Data and Statistics of case studies for Performance Portability Research,https://www.dropbox.com/s/1g9q0s2ymqq9003/Zmy.pdf?dl=0Google Scholar
- [52] Khronos Steps Towards Widespread Deployment of SYCL with Release of SYCL 2020 Provisional Specification, 2020. [Online]. Available: https://www.khronos.org/news/press/Google Scholar
- [53] https://www.oneapi.io/Google Scholar
Index Terms
- On the Performance Portability of OpenACC, OpenMP, Kokkos and RAJA
Recommendations
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Understanding Performance Portability of OpenACC for Supercomputers
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopScientific applications need to be moved among supercomputers, such as Tianhe-2 and TSUBAME 2.5. OpenACC provides a directive-based approach for a single source code base with function portability across different accelerators used in the ...
An overview of performance portability in the uintah runtime system through the use of kokkos
ESPM2: Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and MiddlewareThe current diversity in nodal parallel computer architectures is seen in machines based upon multicore CPUs, GPUs and the Intel Xeon Phi's. A class of approaches for enabling scalability of complex applications on such architectures is based upon ...
Comments