Skip to main content

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2022)

Abstract

Frameworks for Distributed Deep Learning (DDL) have become popular alternatives to distribute training by adding a few lines of code to a single-node script. From a High-Performance Computing (HPC) perspective, traditional profiling tools for researches in Machine Learning (ML) fail to expose details about distributed training performance, such as identifying synchronization points, communication and computing time, and devices usage throughout the training. Moreover, these results are usually considered independently. We present a methodology for performance analysis of DDL frameworks that combines HPC and ML tools to apply intrusive and non-intrusive tracing to enrich the findings for a strong scaling in three clusters with different GPU models. We selected two modern DDL frameworks: Horovod and Tarantella. Using spatial and temporal analysis, we identify bottlenecks in the frameworks, such as a long initialization time for Horovod, the non-distribution of data during the testing phase for Tarantella. We extract performance measurements using temporal aggregation considering the training phases, which can benefit DDL frameworks’ developers to improve their tools. Horovod presented the best scaling efficiency for 4 GPUs or more, with up to 84.6% scaling efficiency for 4 GPUs and large batch size, while Tarantella achieves 54.7% for the same case. Using our temporal aggregation approach, we identified this result origins from Horovod processing an epoch faster than Tarantella.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/cc-hpc-itwm/GPI-2.

  2. 2.

    https://cran.r-project.org/web/packages/tidyverse/index.html.

  3. 3.

    https://github.com/bsc-performance-tools/extrae/tree/GASPI.

  4. 4.

    https://db.rstudio.com/databases/sqlite/.

  5. 5.

    https://scorepci.pages.jsc.fz-juelich.de/scorep-pipelines/docs/scorep-6.0/html/scorepwrapper.html.

  6. 6.

    https://github.com/schnorr/otf2utils.

References

  1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)

  2. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association (2016)

    Google Scholar 

  3. Ravikumar, A., Harini, S.: A comprehensive review and evaluation of distributed deep learning on cloud environments. J. Crit. Rev. 7(19), 9519–9538 (2020)

    Google Scholar 

  4. Cappello, F., et al.: Grid’5000: a large scale and highly reconfigurable grid experimental testbed. In: The 6th IEEE/ACM International Workshop on Grid Computing, pp. 8–pp. IEEE (2005)

    Google Scholar 

  5. Competence Center for HPC: Tarantella: distributed deep learning framework (2020). https://github.com/cc-hpc-itwm/tarantella

  6. Cunha, R.L.F., Rodrigues, E.R., Viana, M.P., Oliveira, D.A.B.: An argument in favor of strong scaling for deep neural networks with small datasets. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 306–313. IEEE (2018)

    Google Scholar 

  7. Dai, J.J., et al.: BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 50–60 (2019)

    Google Scholar 

  8. Gocht, A., Schöne, R., Frenzel, J.: Advanced Python performance monitoring with score-P. In: Mix, H., Niethammer, C., Zhou, H., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2018/2019, pp. 261–270. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66057-4_14

    Chapter  Google Scholar 

  9. Grünewald, D., Simmendinger, C.: The GASPI API specification and its implementation GPI 2.0. In: International Conference on PGAS Programming Models, vol. 243, p. 52 (2013)

    Google Scholar 

  10. Hasheminezhad, B., Shirzad, S., Wu, N., Diehl, P., Schulz, H., Kaiser, H.: Towards a scalable and distributed infrastructure for deep learning applications. In: Workshop on Deep Learning on Supercomputers, pp. 20–30. IEEE (2020)

    Google Scholar 

  11. Jäger, S., Zorn, H.P., Igel, S., Zirpins, C.: Parallelized training of Deep NN: comparison of current concepts and frameworks. In: Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, pp. 15–20 (2018)

    Google Scholar 

  12. Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, Hoboken (1991)

    MATH  Google Scholar 

  13. Jia, X., et al.: Whale: scaling deep learning model training to the trillions. arXiv e-prints arXiv:2011.09208 (2020)

  14. Keras (2020). https://github.com/keras-team/keras

  15. Kim, H., Nam, H., Jung, W., Lee, J.: Performance analysis of CNN frameworks for GPUs. In: International Symposium on Performance Analysis of Systems and Software, pp. 55–64. IEEE (2017)

    Google Scholar 

  16. Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for periscope, Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7

    Chapter  Google Scholar 

  17. Kurth, T., Smorkalov, M., Mendygral, P., Sridharan, S., Mathuriya, A.: TensorFlow at scale: performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML. Concurr. Comput. Pract. Exp. 31(16), e4989 (2019)

    Article  Google Scholar 

  18. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  19. LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist

  20. Liu, J., Dutta, J., Li, N., Kurup, U., Shah, M.: Usability study of distributed deep learning frameworks for convolutional neural networks. In: Deep Learning Day at SIGKDD Conference on Knowledge Discovery and Data Mining (2018)

    Google Scholar 

  21. Mahon, S., Varrette, S., Plugaru, V., Pinel, F., Bouvry, P.: Performance analysis of distributed and scalable deep learning. In: International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 760–766. IEEE (2020)

    Google Scholar 

  22. Mayer, R., Jacobsen, H.A.: Scalable deep learning on distributed infrastructures: challenges, techniques, and tools. ACM Comput. Surv. 53(1), 1–37 (2020)

    Article  Google Scholar 

  23. NVidia: Nvidia system management interface (2020). https://developer.download.nvidia.com/compute/DCGM/docs/NVSMI-367.38.pdf

  24. NVidia: Nvprof, command line profiling tool (2020). http://docs.nvidia.com/cuda/profiler-users-guide

  25. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  26. Python: the Python profilers (2020). https://docs.python.org/3/library/profile.html

  27. Schnorr, L.M., Legrand, A.: Visualizing more performance data than what fits on your screen. In: Cheptsov, A., Brinkmann, S., Gracia, J., Resch, M., Nagel, W. (eds.) Tools for High Performance Computing 2012, pp. 149–162. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37349-7_10

    Chapter  Google Scholar 

  28. Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

  29. Shi, S., Wang, Q., Chu, X.: Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In: IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress, pp. 949–957 (2018). https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4

  30. Van Essen, B., Kim, H., Pearce, R., Boakye, K., Chen, B.: LBANN: livermore big artificial neural network HPC toolkit. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, pp. 1–6 (2015)

    Google Scholar 

  31. Wu, X., Taylor, V., Wozniak, J.M., Stevens, R., Brettin, T., Xia, F.: Performance, power, and scalability analysis of the horovod implementation of the candle Nt3 benchmark on the cray Xc40 theta. In: SC 2018, Workshop on Python for High-Performance and Scientific Computing, Dallas, USA (2018)

    Google Scholar 

Download references

Acknowledgments

We are thankful to the Tarantella, scorep-binding-python, and Score-P developers for the prompt replies that support our advances. This work was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, under grant no 88887.481194/2020-00. The experiments were executed on the PCAD at the Federal University of Rio Grande do Sul, and on the Grid’5000, supported by Inria, CNRS, RENATER and other organizations.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ana Luisa Veroneze Solórzano or Lucas Mello Schnorr .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Veroneze Solórzano, A.L., Mello Schnorr, L. (2022). Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07312-0_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07311-3

  • Online ISBN: 978-3-031-07312-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics