Skip to main content

Performance Evaluation of Python Based Data Analytics Frameworks in Summit: Early Experiences

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1315))

Abstract

The explosion in the volumes of data generated from ever-larger simulation campaigns and experiments or observations necessitates competent tools for data wrangling and analysis). While the Oak Ridge Leadership Computing Facility (OLCF) provides a variety of tools to perform data wrangling and data analysis tasks, Python based tools often lack scalability, or the ability to fully exploit the computational capability of OLCF’s Summit supercomputer. NVIDIA RAPIDS and Dask offer a promising solution to accelerate and distribute data analytics workloads from personal computers to heterogeneous supercomputing systems. We discuss early performance evaluation results of RAPIDS and Dask on Summit to understand their capabilities, scalability, and limitations. Our evaluation includes a subset of RAPIDS libraries, i.e., cuDF, cuML, and cuGraph, and Chainer’s CuPy, and their multi-GPU variants when available.We also draw on the observed trends from the performance evaluation results to discuss best practices for maximizing performance.

B. Hernández et al.—Contributed Equally.

This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In RAPIDS 0.14 multi-gpu/multi-node support was provided by Dask-cuDF. In newer versions, scale-out support has been added into cuDF GitHub repository and Dask-cuDF repository has been archived.

References

  1. Csardi, G., Nepusz, T., et al.: The igraph software package for complex network research. InterJ. Complex Syst. 1695(5), 1–9 (2006)

    Google Scholar 

  2. Dask Development Team. Chunks - DASK (2020). https://docs.dask.org/en/latest/array-chunks.html. Accessed 26 May 2020

  3. Dask Distributed. Managing computation (2016). https://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures. Accessed 22 Sep 2020

  4. Econtal. sgesvd\_buffersize int32 overflow with CUDA (2019). https://github.com/cupy/cupy/issues/2351. Accessed 26 May 2020

  5. Python Software Foundation and JetBrains. Python software foundation survey (2019). https://www.jetbrains.com/lp/python-developers-survey-2019. Accessed 26 May 2020

  6. Hagberg, A., Swart, P., Chult, D.S.: Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab. (LANL), Los Alamos, NM (United States) (2008)

    Google Scholar 

  7. Hernández, B.: Recipes to build, install and execute NVIDIA RAPIDS framework on Summit supercomputer (2020). https://github.com/benjha/nvrapids_olcf. Accessed 26 May 2020

  8. Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)

    Article  Google Scholar 

  9. Kluyver, T.,et al.: Jupyter notebooks-a publishing format for reproducible computational workflows. In: ELPUB, pp. 87–90 (2016)

    Google Scholar 

  10. Lu, H., Hernández, B., Sommath, S., Yin, J.: Nvidia rapids on summit supercomputer: early experiences. In: Nvidia GPU Technology Conference (2020)

    Google Scholar 

  11. Martin, A.M., Townsend, K.P., Miller-Bains, K., Burr E.M.: 2019 Oak Ridge Leadership Computing Facility User Survey. Findings and Recommendations. Technical report, Scientific Assessment & Workforce Development, February 2020

    Google Scholar 

  12. McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14(9), (2011)

    Google Scholar 

  13. NVIDIA. GPUDirect Storage: A Direct Path Between Storage and GPU Memory (2019). https://developer.nvidia.com/blog/gpudirect-storage/. Accessed 14 Sept 2020

  14. NVIDIA. Open GPU Data Science-RAPIDS 2020. https://rapids.ai. Accessed 26 May 2020

  15. NVIDIA. Optimize groupby-agg in dask\_cudf (2020). https://github.com/rapidsai/cudf/pull/6248

  16. Okuta,R., Unno, Y., Nishino, D., Hido, S., Loomis, C.: Cupy: a numpy-compatible library for Nvidia GPU calculations. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in the Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017)

    Google Scholar 

  17. Oral, S., et al.: End-to-end i/o portfolio for the summit supercomputing ecosystem. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, New York, NY, USA. Association for Computing Machinery (2019)

    Google Scholar 

  18. Pedregosa, F.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Sebastian, R., Joshua, P., Corey, N.: Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information 11(4), 193 (2020)

    Article  Google Scholar 

  20. Roberts, S., Mann, C., Marroquin, C.: Redefining IBM power system design for coral. IBM J. Res. Dev. 64(3/4), 1–10 (2020)

    Article  Google Scholar 

  21. Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J., (eds.) Proceedings of the 14th Python in Science Conference, pp. 130–136 (2015)

    Google Scholar 

  22. Schmidt, D., Yin, J., Matheson, M., Messer, B., Shankar, M.: Defining big data analytics benchmarks for next generation supercomputers (2018)

    Google Scholar 

  23. Shamis, P., et al.: UCX: an open source framework for HPC network APIs and beyond. In: IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 40–43 (2015)

    Google Scholar 

  24. Vazhkudai, S.S.: The design, deployment, and evaluation of the coral pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. IEEE Press (2018)

    Google Scholar 

  25. Vergara Larrea, V., et al.: Scaling the summit: deploying the world’s fastest supercomputer. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds.) ISC High Performance 2019. LNCS, vol. 11887, pp. 330–351. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34356-9_26

    Chapter  Google Scholar 

  26. Virtanen, P., et al.: Scipy 1.0: fundamental algorithms for scientific computing in python. Nat. Meth. 17(3), 261–272 (2020)

    Article  Google Scholar 

  27. van der Walt, S., Colbert, S.C., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)

    Article  Google Scholar 

Download references

Acknowledgements

This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamín Hernández .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hernández, B. et al. (2020). Performance Evaluation of Python Based Data Analytics Frameworks in Summit: Early Experiences. In: Nichols, J., Verastegui, B., Maccabe, A.‘., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI. SMC 2020. Communications in Computer and Information Science, vol 1315. Springer, Cham. https://doi.org/10.1007/978-3-030-63393-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63393-6_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63392-9

  • Online ISBN: 978-3-030-63393-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics