NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

Di Domenico, Daniel; Lima, João V. F.; Cavalheiro, Gerson G. H.

doi:10.1007/s11227-022-04932-3

NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

Published: 27 December 2022

Volume 79, pages 8890–8911, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Daniel Di Domenico¹,
João V. F. Lima² &
Gerson G. H. Cavalheiro¹

479 Accesses
Explore all metrics

Abstract

Compiled low-level languages, such as C/C++ and Fortran, have been employed as programming tools to implement applications to explore GPU devices. As a counterpoint to that trend, this paper presents a performance and programming effort analysis with Python, an interpreted and high-level language, which was applied to develop the kernels and applications of NAS Parallel Benchmarks targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement the GPU programs with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB benchmarks. Furthermore, Python codes demanded less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. Despite that, our Python implementations required more operations than OpenACC ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Evaluating GPU Programming Models for the LUMI Supercomputer

Exploring Heterogeneous Computing Environments: A Preliminary Analysis of Python and SYCL Performance

Availability of data and materials

The referenced data and materials are available at: the source codes of NPB implemented with Python are available at: https://github.com/danidomenico/NPB-PYTHON (archived at: https://zenodo.org/badge/latestdoi/444602145)

The statistical analysis report regarding performance experiments is available at: https://github.com/danidomenico/NPB-PYTHON-stat-analysis (archived at: https://zenodo.org/badge/latestdoi/419926058)

Notes

NPB-PYTHON: https://github.com/danidomenico/NPB-PYTHON.
Available at https://github.com/GMAP/NPB-CPP.
Available at https://github.com/GMAP/NPB-GPU.
Statistical analysis report: https://github.com/danidomenico/NPB-PYTHON-stat-analysis.
Environment to reproduce experiments: https://github.com/danidomenico/NPB-PYTHON/tree/master/reproducibility.

References

CUDA C++ Programming Guide: Version 11.2.1. Nvidia (2021)
The OpenCL Spec.: Version 2.2. Khronos Working Group (2019)
OpenACC Specification: Version 3.1. OpenACC.org (2020)
SYCL 2020 Reference Guide: Revision 2. Khronos Working Group (2022)
Holm HH, Brodtkorb AR, Sætra ML (2020) GPU computing with Python: performance, energy efficiency and usability. Computation 8(1). https://doi.org/10.3390/computation8010004
Ziogas AN, Ben-Nun T, Schneider T, Hoefler T (2021) NPBench: abenchmarking suite for high-performance NumPy. In: Proceedings of the ACM international conference on supercomputing. ICS’21. ACM, New York, NY, USA, pp 63–74 https://doi.org/10.1145/3447818.3460360
Oden L (2020) Lessons learned from comparing C-CUDA and Python-Numba for GPU-Computing. In: 2020 28th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 216–223 https://doi.org/10.1109/PDP50117.2020.00041
Numba Documentation: Version 0.50. Anaconda, Inc. and others (2021)
CuPy API Reference: Version 11.4. Preferred Infrastructure, Inc. and Preferred Networks, Inc. (2021)
Klöckner A, Pinto N, Lee Y, Catanzaro B, Ivanov P, Fasih A (2012) PyCUDA and PyOpenCL: ascripting-based approach to GPU run-time code generation. Parallel Comput 38(3):157–174. https://doi.org/10.1016/j.parco.2011.09.001
Article Google Scholar
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Fatoohi RA, Frederickson PO, Lasinski TA, Simon HD, Venkatakrishnan V, Weeratunga SK (1994) The NAS Parallel Benchmarks RNR-94-007. Technical report, NASA Advanced Supercomputing Division
Di Domenico D, Cavalheiro GGH, Lima JVF (2022) Nas parallel benchmark kernels with python: a performance and programming effort analysis focusing on gpus. In: 2022 30th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 26–33. https://doi.org/10.1109/PDP55904.2022.00013
Araujo GAd, Griebler D, Danelutto M, Fernandes LG (2020) Efficient NAS Parallel Benchmark Kernels with CUDA. In: 2020 28th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 9–16. https://doi.org/10.1109/PDP50117.2020.00009
Araujo G, Griebler D, Rockenbach DA, Danelutto M, Fernandes LG (2021) NAS Parallel Benchmarks with CUDA and beyond. Software: Practice and Experience, 1–28. https://doi.org/10.1002/spe.3056
Xu R, Tian X, Chandrasekaran S, Yan Y, Chapman B (2015) NAS Parallel Benchmarks for GPGPUs using a directive-based programming model. In: Brodman J, Tu P (eds) Lang Compil Parallel Comput. Springer, Cham, pp 67–81
Google Scholar
Behnel S, Bradshaw RW, Seljebotn DS (2009) Cython tutorial. In: Varoquaux G, van der Walt S, Millman J (eds) Proceedings of the 8th python in science conference, Pasadena, CA USA pp 4–14
NumPy Documentation: Version 1.21. The NumPy community (2021)
Löff J, Griebler D, Mencagli G, Araujo G, Torquati M, Danelutto M, Fernandes LG (2021) The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Futur Gener Comput Syst 125:743–757. https://doi.org/10.1016/j.future.2021.07.021
Article Google Scholar
Fenton NE, Bieman J (2014) Software metrics: a rigorous and practical approach, 3rd edn. Chapman & Hall/CRC innovations in software engineering and software development series. CRC Press, Boca Raton
Malik M, Li T, Sharif U, Shahid R, El-Ghazawi TA, Newby GB (2012) Productivity of GPUs under different programming paradigms. Concurr Comput Pract Exp 24:179–191
Article Google Scholar
Christgau S, Spazier J, Schnor B, Hammitzsch M, Babeyko A, Waechter J (2014) A comparison of CUDA and OpenACC: accelerating the tsunami simulation EasyWave. In: ARCS 2014; 2014 workshop proceedings on architecture of computing systems, pp 1–5
Memeti S, Li L, Pllana S, Kołodziej J, Kessler C (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of the 2017 workshop on adaptive resource management and scheduling for cloud computing. ARMS-CC’17, pp 1–6. ACM, New York, NY. https://doi.org/10.1145/3110355.3110356
Hoshino T, Maruyama N, Matsuoka S, Takaki R (2013) CUDA vs OpenACC: Performance case studies with kernel benchmarks and a memory-bound CFD application. In: 2013 13th IEEE/ACM international symposium on cluster, cloud, and grid computing, pp 136–143 https://doi.org/10.1109/CCGrid.2013.12
Gimenes TL, Pisani F, Borin E (2018) Evaluating the performance and cost of accelerating seismic processing with CUDA, OpenCL, OpenACC, and OpenMP. In: 2018 IEEE international parallel and distributed processing symposium (IPDPS), pp 399–408. https://doi.org/10.1109/IPDPS.2018.00050
Lima VF, Di Domenico JD (2019) HPSM: a programming framework to exploit multi-CPU and multi-GPU systems simultaneously. Int J Grid Util Comput 10:201. https://doi.org/10.1504/IJGUC.2019.099686
Article Google Scholar
Li L, Kessler C (2017) VectorPU: a generic and efficient data-container and component model for transparent data transfer on GPU-based heterogeneous systems. In: Proceedings of the 8th and 6th workshop on parallel programming and run-time management techniques for many-core architectures and design tools and architectures for multicore embedded comp. Platforms. PARMA-DITAM ’17. ACM, NY, NY, USA, pp 7–12. https://doi.org/10.1145/3029580.3029582
Gong C, Liu J, Qin J, Hu Q, Gong Z (2010) Efficient embarrassingly parallel on graphics processor unit. In: 2010 2nd international conference on education technology and computer, vol 4, pp 4–4004404. https://doi.org/10.1109/ICETC.2010.5529656
Jin H, Kellogg M, Mehrotra P (2012) Using compiler directives for accelerating CFD applications on GPUs. In: Chapman BM, Massaioli F, Müller MS, Rorro M (eds) OpenMP in a heterogeneous world. Springer, Berlin, pp 154–168
Chapter Google Scholar
Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE international symposium on workload characterization (IISWC), pp 137–148. https://doi.org/10.1109/IISWC.2011.6114174
Li X, Shih P-C, Overbey J, Seals C, Lim A (2016) Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. Int J Comput Sci Eng Appl 6:1–15
Google Scholar
Kuan L, Neves J, Pratas F, Tomás P, Sousa L (2014) Accelerating phylogenetic inference on GPUs: an OpenACC and CUDA comparison. In: Rojas I, Guzman FMO (eds) International work-conference on bioinformatics and biomedical engineering, IWBBIO 2014, Granada, Spain, April 7–9, 2014, pp 589–600
Guo X, Wu J, Wu Z, Huang B (2016) Parallel computation of aerial target reflection of background infrared radiation: performance comparison of OpenMP, OpenACC, and CUDA implementations. IEEE J Sel Topics Appl Earth Observ Remote Sens 9(4):1653–1662. https://doi.org/10.1109/JSTARS.2016.2516503
Article Google Scholar
Marowka A (2018) Python accelerators for high-performance computing. J Supercomput 74:1449–1460
Article Google Scholar
Dogaru R, Dogaru I (2015) A low cost high performance computing platform for cellular nonlinear networks using Python for CUDA. In: 2015 20th international conference on control systems and computer science, pp 593–598 https://doi.org/10.1109/CSCS.2015.36

Download references

Acknowledgements

This research received funding from the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, UFSM/FATEC through Project Number 041250-9.07.0025 (100548), and by the project “GREEN-CLOUD” (#16/2551-0000 488-9) from FAPERGS and CNPq Brazil, program PRONEX 12/2014. We gratefully acknowledge the support of NVIDIA Corporation with the donation of two NVIDIA Titan V GPUs used for this research in experiments.

Author information

Authors and Affiliations

Federal University of Pelotas, Pelotas, RS, Brazil
Daniel Di Domenico & Gerson G. H. Cavalheiro
Federal University of Santa Maria, Santa Maria, RS, Brazil
João V. F. Lima

Authors

Daniel Di Domenico
View author publications
You can also search for this author inPubMed Google Scholar
João V. F. Lima
View author publications
You can also search for this author inPubMed Google Scholar
Gerson G. H. Cavalheiro
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

DDD wrote the manuscript text, prepared the figures and tables, proceeded the implementations of the applications, as well as executed the experiments. JVFL and GGHC contributed with the analysis of the experimental results, regarding both performance and programming effort. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Daniel Di Domenico or João V. F. Lima.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interests.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (CU 1 KB)

Supplementary file2 (PY 1 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Di Domenico, D., Lima, J.V.F. & Cavalheiro, G.G.H. NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs. J Supercomput 79, 8890–8911 (2023). https://doi.org/10.1007/s11227-022-04932-3

Download citation

Accepted: 06 November 2022
Published: 27 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11227-022-04932-3

Keywords

Part of a collection:

SI - Parallel, Distributed, and Network-Based Processing

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Evaluating GPU Programming Models for the LUMI Supercomputer

Exploring Heterogeneous Computing Environments: A Preliminary Analysis of Python and SYCL Performance

Availability of data and materials

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (CU 1 KB)

Supplementary file2 (PY 1 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now