Abstract
Compiled low-level languages, such as C/C++ and Fortran, have been employed as programming tools to implement applications to explore GPU devices. As a counterpoint to that trend, this paper presents a performance and programming effort analysis with Python, an interpreted and high-level language, which was applied to develop the kernels and applications of NAS Parallel Benchmarks targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement the GPU programs with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB benchmarks. Furthermore, Python codes demanded less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. Despite that, our Python implementations required more operations than OpenACC ones.





Similar content being viewed by others
Availability of data and materials
The referenced data and materials are available at: the source codes of NPB implemented with Python are available at: https://github.com/danidomenico/NPB-PYTHON (archived at: https://zenodo.org/badge/latestdoi/444602145)
The statistical analysis report regarding performance experiments is available at: https://github.com/danidomenico/NPB-PYTHON-stat-analysis (archived at: https://zenodo.org/badge/latestdoi/419926058)
Notes
NPB-PYTHON: https://github.com/danidomenico/NPB-PYTHON.
Available at https://github.com/GMAP/NPB-CPP.
Available at https://github.com/GMAP/NPB-GPU.
Statistical analysis report: https://github.com/danidomenico/NPB-PYTHON-stat-analysis.
Environment to reproduce experiments: https://github.com/danidomenico/NPB-PYTHON/tree/master/reproducibility.
References
CUDA C++ Programming Guide: Version 11.2.1. Nvidia (2021)
The OpenCL Spec.: Version 2.2. Khronos Working Group (2019)
OpenACC Specification: Version 3.1. OpenACC.org (2020)
SYCL 2020 Reference Guide: Revision 2. Khronos Working Group (2022)
Holm HH, Brodtkorb AR, Sætra ML (2020) GPU computing with Python: performance, energy efficiency and usability. Computation 8(1). https://doi.org/10.3390/computation8010004
Ziogas AN, Ben-Nun T, Schneider T, Hoefler T (2021) NPBench: abenchmarking suite for high-performance NumPy. In: Proceedings of the ACM international conference on supercomputing. ICS’21. ACM, New York, NY, USA, pp 63–74 https://doi.org/10.1145/3447818.3460360
Oden L (2020) Lessons learned from comparing C-CUDA and Python-Numba for GPU-Computing. In: 2020 28th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 216–223 https://doi.org/10.1109/PDP50117.2020.00041
Numba Documentation: Version 0.50. Anaconda, Inc. and others (2021)
CuPy API Reference: Version 11.4. Preferred Infrastructure, Inc. and Preferred Networks, Inc. (2021)
Klöckner A, Pinto N, Lee Y, Catanzaro B, Ivanov P, Fasih A (2012) PyCUDA and PyOpenCL: ascripting-based approach to GPU run-time code generation. Parallel Comput 38(3):157–174. https://doi.org/10.1016/j.parco.2011.09.001
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Fatoohi RA, Frederickson PO, Lasinski TA, Simon HD, Venkatakrishnan V, Weeratunga SK (1994) The NAS Parallel Benchmarks RNR-94-007. Technical report, NASA Advanced Supercomputing Division
Di Domenico D, Cavalheiro GGH, Lima JVF (2022) Nas parallel benchmark kernels with python: a performance and programming effort analysis focusing on gpus. In: 2022 30th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 26–33. https://doi.org/10.1109/PDP55904.2022.00013
Araujo GAd, Griebler D, Danelutto M, Fernandes LG (2020) Efficient NAS Parallel Benchmark Kernels with CUDA. In: 2020 28th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 9–16. https://doi.org/10.1109/PDP50117.2020.00009
Araujo G, Griebler D, Rockenbach DA, Danelutto M, Fernandes LG (2021) NAS Parallel Benchmarks with CUDA and beyond. Software: Practice and Experience, 1–28. https://doi.org/10.1002/spe.3056
Xu R, Tian X, Chandrasekaran S, Yan Y, Chapman B (2015) NAS Parallel Benchmarks for GPGPUs using a directive-based programming model. In: Brodman J, Tu P (eds) Lang Compil Parallel Comput. Springer, Cham, pp 67–81
Behnel S, Bradshaw RW, Seljebotn DS (2009) Cython tutorial. In: Varoquaux G, van der Walt S, Millman J (eds) Proceedings of the 8th python in science conference, Pasadena, CA USA pp 4–14
NumPy Documentation: Version 1.21. The NumPy community (2021)
Löff J, Griebler D, Mencagli G, Araujo G, Torquati M, Danelutto M, Fernandes LG (2021) The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Futur Gener Comput Syst 125:743–757. https://doi.org/10.1016/j.future.2021.07.021
Fenton NE, Bieman J (2014) Software metrics: a rigorous and practical approach, 3rd edn. Chapman & Hall/CRC innovations in software engineering and software development series. CRC Press, Boca Raton
Malik M, Li T, Sharif U, Shahid R, El-Ghazawi TA, Newby GB (2012) Productivity of GPUs under different programming paradigms. Concurr Comput Pract Exp 24:179–191
Christgau S, Spazier J, Schnor B, Hammitzsch M, Babeyko A, Waechter J (2014) A comparison of CUDA and OpenACC: accelerating the tsunami simulation EasyWave. In: ARCS 2014; 2014 workshop proceedings on architecture of computing systems, pp 1–5
Memeti S, Li L, Pllana S, Kołodziej J, Kessler C (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of the 2017 workshop on adaptive resource management and scheduling for cloud computing. ARMS-CC’17, pp 1–6. ACM, New York, NY. https://doi.org/10.1145/3110355.3110356
Hoshino T, Maruyama N, Matsuoka S, Takaki R (2013) CUDA vs OpenACC: Performance case studies with kernel benchmarks and a memory-bound CFD application. In: 2013 13th IEEE/ACM international symposium on cluster, cloud, and grid computing, pp 136–143 https://doi.org/10.1109/CCGrid.2013.12
Gimenes TL, Pisani F, Borin E (2018) Evaluating the performance and cost of accelerating seismic processing with CUDA, OpenCL, OpenACC, and OpenMP. In: 2018 IEEE international parallel and distributed processing symposium (IPDPS), pp 399–408. https://doi.org/10.1109/IPDPS.2018.00050
Lima VF, Di Domenico JD (2019) HPSM: a programming framework to exploit multi-CPU and multi-GPU systems simultaneously. Int J Grid Util Comput 10:201. https://doi.org/10.1504/IJGUC.2019.099686
Li L, Kessler C (2017) VectorPU: a generic and efficient data-container and component model for transparent data transfer on GPU-based heterogeneous systems. In: Proceedings of the 8th and 6th workshop on parallel programming and run-time management techniques for many-core architectures and design tools and architectures for multicore embedded comp. Platforms. PARMA-DITAM ’17. ACM, NY, NY, USA, pp 7–12. https://doi.org/10.1145/3029580.3029582
Gong C, Liu J, Qin J, Hu Q, Gong Z (2010) Efficient embarrassingly parallel on graphics processor unit. In: 2010 2nd international conference on education technology and computer, vol 4, pp 4–4004404. https://doi.org/10.1109/ICETC.2010.5529656
Jin H, Kellogg M, Mehrotra P (2012) Using compiler directives for accelerating CFD applications on GPUs. In: Chapman BM, Massaioli F, Müller MS, Rorro M (eds) OpenMP in a heterogeneous world. Springer, Berlin, pp 154–168
Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE international symposium on workload characterization (IISWC), pp 137–148. https://doi.org/10.1109/IISWC.2011.6114174
Li X, Shih P-C, Overbey J, Seals C, Lim A (2016) Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. Int J Comput Sci Eng Appl 6:1–15
Kuan L, Neves J, Pratas F, Tomás P, Sousa L (2014) Accelerating phylogenetic inference on GPUs: an OpenACC and CUDA comparison. In: Rojas I, Guzman FMO (eds) International work-conference on bioinformatics and biomedical engineering, IWBBIO 2014, Granada, Spain, April 7–9, 2014, pp 589–600
Guo X, Wu J, Wu Z, Huang B (2016) Parallel computation of aerial target reflection of background infrared radiation: performance comparison of OpenMP, OpenACC, and CUDA implementations. IEEE J Sel Topics Appl Earth Observ Remote Sens 9(4):1653–1662. https://doi.org/10.1109/JSTARS.2016.2516503
Marowka A (2018) Python accelerators for high-performance computing. J Supercomput 74:1449–1460
Dogaru R, Dogaru I (2015) A low cost high performance computing platform for cellular nonlinear networks using Python for CUDA. In: 2015 20th international conference on control systems and computer science, pp 593–598 https://doi.org/10.1109/CSCS.2015.36
Acknowledgements
This research received funding from the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, UFSM/FATEC through Project Number 041250-9.07.0025 (100548), and by the project “GREEN-CLOUD” (#16/2551-0000 488-9) from FAPERGS and CNPq Brazil, program PRONEX 12/2014. We gratefully acknowledge the support of NVIDIA Corporation with the donation of two NVIDIA Titan V GPUs used for this research in experiments.
Author information
Authors and Affiliations
Contributions
DDD wrote the manuscript text, prepared the figures and tables, proceeded the implementations of the applications, as well as executed the experiments. JVFL and GGHC contributed with the analysis of the experimental results, regarding both performance and programming effort. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interests.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Di Domenico, D., Lima, J.V.F. & Cavalheiro, G.G.H. NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs. J Supercomput 79, 8890–8911 (2023). https://doi.org/10.1007/s11227-022-04932-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04932-3