Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Veroneze Solórzano, Ana Luisa; Mello Schnorr, Lucas

doi:10.1007/978-3-031-07312-0_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13289))

Included in the following conference series:

International Conference on High Performance Computing

1240 Accesses
1 Citations

Abstract

Frameworks for Distributed Deep Learning (DDL) have become popular alternatives to distribute training by adding a few lines of code to a single-node script. From a High-Performance Computing (HPC) perspective, traditional profiling tools for researches in Machine Learning (ML) fail to expose details about distributed training performance, such as identifying synchronization points, communication and computing time, and devices usage throughout the training. Moreover, these results are usually considered independently. We present a methodology for performance analysis of DDL frameworks that combines HPC and ML tools to apply intrusive and non-intrusive tracing to enrich the findings for a strong scaling in three clusters with different GPU models. We selected two modern DDL frameworks: Horovod and Tarantella. Using spatial and temporal analysis, we identify bottlenecks in the frameworks, such as a long initialization time for Horovod, the non-distribution of data during the testing phase for Tarantella. We extract performance measurements using temporal aggregation considering the training phases, which can benefit DDL frameworks’ developers to improve their tools. Horovod presented the best scaling efficiency for 4 GPUs or more, with up to 84.6% scaling efficiency for 4 GPUs and large batch size, while Tarantella achieves 54.7% for the same case. Using our temporal aggregation approach, we identified this result origins from Horovod processing an epoch faster than Tarantella.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association (2016)
Google Scholar
Ravikumar, A., Harini, S.: A comprehensive review and evaluation of distributed deep learning on cloud environments. J. Crit. Rev. 7(19), 9519–9538 (2020)
Google Scholar
Cappello, F., et al.: Grid’5000: a large scale and highly reconfigurable grid experimental testbed. In: The 6th IEEE/ACM International Workshop on Grid Computing, pp. 8–pp. IEEE (2005)
Google Scholar
Competence Center for HPC: Tarantella: distributed deep learning framework (2020). https://github.com/cc-hpc-itwm/tarantella
Cunha, R.L.F., Rodrigues, E.R., Viana, M.P., Oliveira, D.A.B.: An argument in favor of strong scaling for deep neural networks with small datasets. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 306–313. IEEE (2018)
Google Scholar
Dai, J.J., et al.: BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 50–60 (2019)
Google Scholar
Gocht, A., Schöne, R., Frenzel, J.: Advanced Python performance monitoring with score-P. In: Mix, H., Niethammer, C., Zhou, H., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2018/2019, pp. 261–270. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66057-4_14
Chapter Google Scholar
Grünewald, D., Simmendinger, C.: The GASPI API specification and its implementation GPI 2.0. In: International Conference on PGAS Programming Models, vol. 243, p. 52 (2013)
Google Scholar
Hasheminezhad, B., Shirzad, S., Wu, N., Diehl, P., Schulz, H., Kaiser, H.: Towards a scalable and distributed infrastructure for deep learning applications. In: Workshop on Deep Learning on Supercomputers, pp. 20–30. IEEE (2020)
Google Scholar
Jäger, S., Zorn, H.P., Igel, S., Zirpins, C.: Parallelized training of Deep NN: comparison of current concepts and frameworks. In: Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, pp. 15–20 (2018)
Google Scholar
Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, Hoboken (1991)
MATH Google Scholar
Jia, X., et al.: Whale: scaling deep learning model training to the trillions. arXiv e-prints arXiv:2011.09208 (2020)
Keras (2020). https://github.com/keras-team/keras
Kim, H., Nam, H., Jung, W., Lee, J.: Performance analysis of CNN frameworks for GPUs. In: International Symposium on Performance Analysis of Systems and Software, pp. 55–64. IEEE (2017)
Google Scholar
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for periscope, Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7
Chapter Google Scholar
Kurth, T., Smorkalov, M., Mendygral, P., Sridharan, S., Mathuriya, A.: TensorFlow at scale: performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML. Concurr. Comput. Pract. Exp. 31(16), e4989 (2019)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist
Liu, J., Dutta, J., Li, N., Kurup, U., Shah, M.: Usability study of distributed deep learning frameworks for convolutional neural networks. In: Deep Learning Day at SIGKDD Conference on Knowledge Discovery and Data Mining (2018)
Google Scholar
Mahon, S., Varrette, S., Plugaru, V., Pinel, F., Bouvry, P.: Performance analysis of distributed and scalable deep learning. In: International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 760–766. IEEE (2020)
Google Scholar
Mayer, R., Jacobsen, H.A.: Scalable deep learning on distributed infrastructures: challenges, techniques, and tools. ACM Comput. Surv. 53(1), 1–37 (2020)
Article Google Scholar
NVidia: Nvidia system management interface (2020). https://developer.download.nvidia.com/compute/DCGM/docs/NVSMI-367.38.pdf
NVidia: Nvprof, command line profiling tool (2020). http://docs.nvidia.com/cuda/profiler-users-guide
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Python: the Python profilers (2020). https://docs.python.org/3/library/profile.html
Schnorr, L.M., Legrand, A.: Visualizing more performance data than what fits on your screen. In: Cheptsov, A., Brinkmann, S., Gracia, J., Resch, M., Nagel, W. (eds.) Tools for High Performance Computing 2012, pp. 149–162. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37349-7_10
Chapter Google Scholar
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
Shi, S., Wang, Q., Chu, X.: Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In: IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress, pp. 949–957 (2018). https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4
Van Essen, B., Kim, H., Pearce, R., Boakye, K., Chen, B.: LBANN: livermore big artificial neural network HPC toolkit. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, pp. 1–6 (2015)
Google Scholar
Wu, X., Taylor, V., Wozniak, J.M., Stevens, R., Brettin, T., Xia, F.: Performance, power, and scalability analysis of the horovod implementation of the candle Nt3 benchmark on the cray Xc40 theta. In: SC 2018, Workshop on Python for High-Performance and Scientific Computing, Dallas, USA (2018)
Google Scholar

Download references

Acknowledgments

We are thankful to the Tarantella, scorep-binding-python, and Score-P developers for the prompt replies that support our advances. This work was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, under grant no 88887.481194/2020-00. The experiments were executed on the PCAD at the Federal University of Rio Grande do Sul, and on the Grid’5000, supported by Inria, CNRS, RENATER and other organizations.

Author information

Authors and Affiliations

Informatics Institute (PPGC/UFRGS), Porto Alegre, Brazil
Ana Luisa Veroneze Solórzano & Lucas Mello Schnorr

Authors

Ana Luisa Veroneze Solórzano
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Mello Schnorr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ana Luisa Veroneze Solórzano or Lucas Mello Schnorr .

Editor information

Editors and Affiliations

University of Twente, Enschede, The Netherlands
Ana-Lucia Varbanescu
University of Maryland, College Park, MD, USA
Abhinav Bhatele
University of Tennessee, Knoxville, TN, USA
Piotr Luszczek
Université Paris-Saclay, Orsay, France
Baboulin Marc

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Veroneze Solórzano, A.L., Mello Schnorr, L. (2022). Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-07312-0_14
Published: 29 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07311-3
Online ISBN: 978-3-031-07312-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements