Skip to main content
Log in

NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Compiled low-level languages, such as C/C++ and Fortran, have been employed as programming tools to implement applications to explore GPU devices. As a counterpoint to that trend, this paper presents a performance and programming effort analysis with Python, an interpreted and high-level language, which was applied to develop the kernels and applications of NAS Parallel Benchmarks targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement the GPU programs with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB benchmarks. Furthermore, Python codes demanded less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. Despite that, our Python implementations required more operations than OpenACC ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The referenced data and materials are available at: the source codes of NPB implemented with Python are available at: https://github.com/danidomenico/NPB-PYTHON (archived at: https://zenodo.org/badge/latestdoi/444602145)

The statistical analysis report regarding performance experiments is available at: https://github.com/danidomenico/NPB-PYTHON-stat-analysis (archived at: https://zenodo.org/badge/latestdoi/419926058)

Notes

  1. NPB-PYTHON: https://github.com/danidomenico/NPB-PYTHON.

  2. Available at https://github.com/GMAP/NPB-CPP.

  3. Available at https://github.com/GMAP/NPB-GPU.

  4. Statistical analysis report: https://github.com/danidomenico/NPB-PYTHON-stat-analysis.

  5. Environment to reproduce experiments: https://github.com/danidomenico/NPB-PYTHON/tree/master/reproducibility.

References

  1. CUDA C++ Programming Guide: Version 11.2.1. Nvidia (2021)

  2. The OpenCL Spec.: Version 2.2. Khronos Working Group (2019)

  3. OpenACC Specification: Version 3.1. OpenACC.org (2020)

  4. SYCL 2020 Reference Guide: Revision 2. Khronos Working Group (2022)

  5. Holm HH, Brodtkorb AR, Sætra ML (2020) GPU computing with Python: performance, energy efficiency and usability. Computation 8(1). https://doi.org/10.3390/computation8010004

  6. Ziogas AN, Ben-Nun T, Schneider T, Hoefler T (2021) NPBench: abenchmarking suite for high-performance NumPy. In: Proceedings of the ACM international conference on supercomputing. ICS’21. ACM, New York, NY, USA, pp 63–74 https://doi.org/10.1145/3447818.3460360

  7. Oden L (2020) Lessons learned from comparing C-CUDA and Python-Numba for GPU-Computing. In: 2020 28th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 216–223 https://doi.org/10.1109/PDP50117.2020.00041

  8. Numba Documentation: Version 0.50. Anaconda, Inc. and others (2021)

  9. CuPy API Reference: Version 11.4. Preferred Infrastructure, Inc. and Preferred Networks, Inc. (2021)

  10. Klöckner A, Pinto N, Lee Y, Catanzaro B, Ivanov P, Fasih A (2012) PyCUDA and PyOpenCL: ascripting-based approach to GPU run-time code generation. Parallel Comput 38(3):157–174. https://doi.org/10.1016/j.parco.2011.09.001

    Article  Google Scholar 

  11. Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Fatoohi RA, Frederickson PO, Lasinski TA, Simon HD, Venkatakrishnan V, Weeratunga SK (1994) The NAS Parallel Benchmarks RNR-94-007. Technical report, NASA Advanced Supercomputing Division

  12. Di Domenico D, Cavalheiro GGH, Lima JVF (2022) Nas parallel benchmark kernels with python: a performance and programming effort analysis focusing on gpus. In: 2022 30th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 26–33. https://doi.org/10.1109/PDP55904.2022.00013

  13. Araujo GAd, Griebler D, Danelutto M, Fernandes LG (2020) Efficient NAS Parallel Benchmark Kernels with CUDA. In: 2020 28th Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 9–16. https://doi.org/10.1109/PDP50117.2020.00009

  14. Araujo G, Griebler D, Rockenbach DA, Danelutto M, Fernandes LG (2021) NAS Parallel Benchmarks with CUDA and beyond. Software: Practice and Experience, 1–28. https://doi.org/10.1002/spe.3056

  15. Xu R, Tian X, Chandrasekaran S, Yan Y, Chapman B (2015) NAS Parallel Benchmarks for GPGPUs using a directive-based programming model. In: Brodman J, Tu P (eds) Lang Compil Parallel Comput. Springer, Cham, pp 67–81

    Google Scholar 

  16. Behnel S, Bradshaw RW, Seljebotn DS (2009) Cython tutorial. In: Varoquaux G, van der Walt S, Millman J (eds) Proceedings of the 8th python in science conference, Pasadena, CA USA pp 4–14

  17. NumPy Documentation: Version 1.21. The NumPy community (2021)

  18. Löff J, Griebler D, Mencagli G, Araujo G, Torquati M, Danelutto M, Fernandes LG (2021) The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Futur Gener Comput Syst 125:743–757. https://doi.org/10.1016/j.future.2021.07.021

    Article  Google Scholar 

  19. Fenton NE, Bieman J (2014) Software metrics: a rigorous and practical approach, 3rd edn. Chapman & Hall/CRC innovations in software engineering and software development series. CRC Press, Boca Raton

  20. Malik M, Li T, Sharif U, Shahid R, El-Ghazawi TA, Newby GB (2012) Productivity of GPUs under different programming paradigms. Concurr Comput Pract Exp 24:179–191

    Article  Google Scholar 

  21. Christgau S, Spazier J, Schnor B, Hammitzsch M, Babeyko A, Waechter J (2014) A comparison of CUDA and OpenACC: accelerating the tsunami simulation EasyWave. In: ARCS 2014; 2014 workshop proceedings on architecture of computing systems, pp 1–5

  22. Memeti S, Li L, Pllana S, Kołodziej J, Kessler C (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of the 2017 workshop on adaptive resource management and scheduling for cloud computing. ARMS-CC’17, pp 1–6. ACM, New York, NY. https://doi.org/10.1145/3110355.3110356

  23. Hoshino T, Maruyama N, Matsuoka S, Takaki R (2013) CUDA vs OpenACC: Performance case studies with kernel benchmarks and a memory-bound CFD application. In: 2013 13th IEEE/ACM international symposium on cluster, cloud, and grid computing, pp 136–143 https://doi.org/10.1109/CCGrid.2013.12

  24. Gimenes TL, Pisani F, Borin E (2018) Evaluating the performance and cost of accelerating seismic processing with CUDA, OpenCL, OpenACC, and OpenMP. In: 2018 IEEE international parallel and distributed processing symposium (IPDPS), pp 399–408. https://doi.org/10.1109/IPDPS.2018.00050

  25. Lima VF, Di Domenico JD (2019) HPSM: a programming framework to exploit multi-CPU and multi-GPU systems simultaneously. Int J Grid Util Comput 10:201. https://doi.org/10.1504/IJGUC.2019.099686

    Article  Google Scholar 

  26. Li L, Kessler C (2017) VectorPU: a generic and efficient data-container and component model for transparent data transfer on GPU-based heterogeneous systems. In: Proceedings of the 8th and 6th workshop on parallel programming and run-time management techniques for many-core architectures and design tools and architectures for multicore embedded comp. Platforms. PARMA-DITAM ’17. ACM, NY, NY, USA, pp 7–12. https://doi.org/10.1145/3029580.3029582

  27. Gong C, Liu J, Qin J, Hu Q, Gong Z (2010) Efficient embarrassingly parallel on graphics processor unit. In: 2010 2nd international conference on education technology and computer, vol 4, pp 4–4004404. https://doi.org/10.1109/ICETC.2010.5529656

  28. Jin H, Kellogg M, Mehrotra P (2012) Using compiler directives for accelerating CFD applications on GPUs. In: Chapman BM, Massaioli F, Müller MS, Rorro M (eds) OpenMP in a heterogeneous world. Springer, Berlin, pp 154–168

    Chapter  Google Scholar 

  29. Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE international symposium on workload characterization (IISWC), pp 137–148. https://doi.org/10.1109/IISWC.2011.6114174

  30. Li X, Shih P-C, Overbey J, Seals C, Lim A (2016) Comparing programmer productivity in OpenACC and CUDA: an empirical investigation. Int J Comput Sci Eng Appl 6:1–15

    Google Scholar 

  31. Kuan L, Neves J, Pratas F, Tomás P, Sousa L (2014) Accelerating phylogenetic inference on GPUs: an OpenACC and CUDA comparison. In: Rojas I, Guzman FMO (eds) International work-conference on bioinformatics and biomedical engineering, IWBBIO 2014, Granada, Spain, April 7–9, 2014, pp 589–600

  32. Guo X, Wu J, Wu Z, Huang B (2016) Parallel computation of aerial target reflection of background infrared radiation: performance comparison of OpenMP, OpenACC, and CUDA implementations. IEEE J Sel Topics Appl Earth Observ Remote Sens 9(4):1653–1662. https://doi.org/10.1109/JSTARS.2016.2516503

    Article  Google Scholar 

  33. Marowka A (2018) Python accelerators for high-performance computing. J Supercomput 74:1449–1460

    Article  Google Scholar 

  34. Dogaru R, Dogaru I (2015) A low cost high performance computing platform for cellular nonlinear networks using Python for CUDA. In: 2015 20th international conference on control systems and computer science, pp 593–598 https://doi.org/10.1109/CSCS.2015.36

Download references

Acknowledgements

This research received funding from the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, UFSM/FATEC through Project Number 041250-9.07.0025 (100548), and by the project “GREEN-CLOUD” (#16/2551-0000 488-9) from FAPERGS and CNPq Brazil, program PRONEX 12/2014. We gratefully acknowledge the support of NVIDIA Corporation with the donation of two NVIDIA Titan V GPUs used for this research in experiments.

Author information

Authors and Affiliations

Authors

Contributions

DDD wrote the manuscript text, prepared the figures and tables, proceeded the implementations of the applications, as well as executed the experiments. JVFL and GGHC contributed with the analysis of the experimental results, regarding both performance and programming effort. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Daniel Di Domenico or João V. F. Lima.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interests.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (CU 1 KB)

Supplementary file2 (PY 1 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Di Domenico, D., Lima, J.V.F. & Cavalheiro, G.G.H. NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs. J Supercomput 79, 8890–8911 (2023). https://doi.org/10.1007/s11227-022-04932-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04932-3

Keywords

Navigation