Abstract
Frameworks for Distributed Deep Learning (DDL) have become popular alternatives to distribute training by adding a few lines of code to a single-node script. From a High-Performance Computing (HPC) perspective, traditional profiling tools for researches in Machine Learning (ML) fail to expose details about distributed training performance, such as identifying synchronization points, communication and computing time, and devices usage throughout the training. Moreover, these results are usually considered independently. We present a methodology for performance analysis of DDL frameworks that combines HPC and ML tools to apply intrusive and non-intrusive tracing to enrich the findings for a strong scaling in three clusters with different GPU models. We selected two modern DDL frameworks: Horovod and Tarantella. Using spatial and temporal analysis, we identify bottlenecks in the frameworks, such as a long initialization time for Horovod, the non-distribution of data during the testing phase for Tarantella. We extract performance measurements using temporal aggregation considering the training phases, which can benefit DDL frameworks’ developers to improve their tools. Horovod presented the best scaling efficiency for 4 GPUs or more, with up to 84.6% scaling efficiency for 4 GPUs and large batch size, while Tarantella achieves 54.7% for the same case. Using our temporal aggregation approach, we identified this result origins from Horovod processing an epoch faster than Tarantella.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association (2016)
Ravikumar, A., Harini, S.: A comprehensive review and evaluation of distributed deep learning on cloud environments. J. Crit. Rev. 7(19), 9519–9538 (2020)
Cappello, F., et al.: Grid’5000: a large scale and highly reconfigurable grid experimental testbed. In: The 6th IEEE/ACM International Workshop on Grid Computing, pp. 8–pp. IEEE (2005)
Competence Center for HPC: Tarantella: distributed deep learning framework (2020). https://github.com/cc-hpc-itwm/tarantella
Cunha, R.L.F., Rodrigues, E.R., Viana, M.P., Oliveira, D.A.B.: An argument in favor of strong scaling for deep neural networks with small datasets. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 306–313. IEEE (2018)
Dai, J.J., et al.: BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 50–60 (2019)
Gocht, A., Schöne, R., Frenzel, J.: Advanced Python performance monitoring with score-P. In: Mix, H., Niethammer, C., Zhou, H., Nagel, W.E., Resch, M.M. (eds.) Tools for High Performance Computing 2018/2019, pp. 261–270. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66057-4_14
Grünewald, D., Simmendinger, C.: The GASPI API specification and its implementation GPI 2.0. In: International Conference on PGAS Programming Models, vol. 243, p. 52 (2013)
Hasheminezhad, B., Shirzad, S., Wu, N., Diehl, P., Schulz, H., Kaiser, H.: Towards a scalable and distributed infrastructure for deep learning applications. In: Workshop on Deep Learning on Supercomputers, pp. 20–30. IEEE (2020)
Jäger, S., Zorn, H.P., Igel, S., Zirpins, C.: Parallelized training of Deep NN: comparison of current concepts and frameworks. In: Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, pp. 15–20 (2018)
Jain, R.: The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, Hoboken (1991)
Jia, X., et al.: Whale: scaling deep learning model training to the trillions. arXiv e-prints arXiv:2011.09208 (2020)
Keras (2020). https://github.com/keras-team/keras
Kim, H., Nam, H., Jung, W., Lee, J.: Performance analysis of CNN frameworks for GPUs. In: International Symposium on Performance Analysis of Systems and Software, pp. 55–64. IEEE (2017)
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for periscope, Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7
Kurth, T., Smorkalov, M., Mendygral, P., Sridharan, S., Mathuriya, A.: TensorFlow at scale: performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML. Concurr. Comput. Pract. Exp. 31(16), e4989 (2019)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
LeCun, Y., Cortes, C., Burges, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist
Liu, J., Dutta, J., Li, N., Kurup, U., Shah, M.: Usability study of distributed deep learning frameworks for convolutional neural networks. In: Deep Learning Day at SIGKDD Conference on Knowledge Discovery and Data Mining (2018)
Mahon, S., Varrette, S., Plugaru, V., Pinel, F., Bouvry, P.: Performance analysis of distributed and scalable deep learning. In: International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 760–766. IEEE (2020)
Mayer, R., Jacobsen, H.A.: Scalable deep learning on distributed infrastructures: challenges, techniques, and tools. ACM Comput. Surv. 53(1), 1–37 (2020)
NVidia: Nvidia system management interface (2020). https://developer.download.nvidia.com/compute/DCGM/docs/NVSMI-367.38.pdf
NVidia: Nvprof, command line profiling tool (2020). http://docs.nvidia.com/cuda/profiler-users-guide
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Python: the Python profilers (2020). https://docs.python.org/3/library/profile.html
Schnorr, L.M., Legrand, A.: Visualizing more performance data than what fits on your screen. In: Cheptsov, A., Brinkmann, S., Gracia, J., Resch, M., Nagel, W. (eds.) Tools for High Performance Computing 2012, pp. 149–162. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37349-7_10
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
Shi, S., Wang, Q., Chu, X.: Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In: IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress, pp. 949–957 (2018). https://doi.org/10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4
Van Essen, B., Kim, H., Pearce, R., Boakye, K., Chen, B.: LBANN: livermore big artificial neural network HPC toolkit. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, pp. 1–6 (2015)
Wu, X., Taylor, V., Wozniak, J.M., Stevens, R., Brettin, T., Xia, F.: Performance, power, and scalability analysis of the horovod implementation of the candle Nt3 benchmark on the cray Xc40 theta. In: SC 2018, Workshop on Python for High-Performance and Scientific Computing, Dallas, USA (2018)
Acknowledgments
We are thankful to the Tarantella, scorep-binding-python, and Score-P developers for the prompt replies that support our advances. This work was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, under grant no 88887.481194/2020-00. The experiments were executed on the PCAD at the Federal University of Rio Grande do Sul, and on the Grid’5000, supported by Inria, CNRS, RENATER and other organizations.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Veroneze Solórzano, A.L., Mello Schnorr, L. (2022). Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-07312-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07311-3
Online ISBN: 978-3-031-07312-0
eBook Packages: Computer ScienceComputer Science (R0)