Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Rojas, Elvis; Quirós-Corella, Fabricio; Jones, Terry; Meneses, Esteban

doi:10.1007/978-3-031-04209-6_13

Elvis Rojas^8,9,
Fabricio Quirós-Corella¹¹,
Terry Jones¹⁰ &
…
Esteban Meneses^8,11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1540))

Included in the following conference series:

Latin American High Performance Computing Conference

512 Accesses
2 Citations

Abstract

Artificial intelligence is a transforming technology for creating new scientific discoveries, services, and products. Its full potential is achieved when massive data repositories and large-scale computing systems are available. Both factors are becoming easier to obtain daily as sensor networks constantly create open-data archives, and Moore’s law still makes supercomputing power more accessible. However, as deep learning models become larger to tackle data complexity, researchers must determine how to speed up training in those models. This paper uses an experimental approach to try to understand the algorithms and trade-offs associated with distributed deep learning. This study used the Summit supercomputer at Oak Ridge National Laboratory to determine that existing distributed deep learning mechanisms scale in execution time. However, as more nodes are used, accuracy degrades significantly. To solve this, several hyper-parameters must be tuned. The results show that optimizing those parameters is a nontrivial task. We also evaluated the impact of other scaling techniques, such as mixed precision and adaptive parameter optimization.

Notice: This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baidu-Research: DeepBench. https://github.com/baidu-research/DeepBench
Baines, M., et al.: Fairscale: a general purpose modular pytorch library for high performance and large scale training (2021). https://github.com/facebookresearch/fairscale
Ben-Nun, T., Besta, M., Huber, S., Ziogas, A.N., Peter, D., Hoefler, T.: A modular benchmarking infrastructure for high-performance and reproducible deep learning. In: 2019 IEEE IPDPS, pp. 66–77. IEEE(2019)
Google Scholar
Coleman, C., et al.: Dawnbench: an end-to-end deep learning benchmark and competition. Training 100(101), 102 (2017)
Google Scholar
Droegemeier, K.: 2016–2019 progress report: advancing artificial intelligence R&D, a report by the artificial intelligence research & development interagency working group, November 2019
Google Scholar
Fagnan, K., Nashed, Y., Perdue, G., Ratner, D., Shankar, A., Yoo, S.: Data and models: a framework for advancing AI in science, December 2019. https://doi.org/10.2172/1579323, https://www.osti.gov/biblio/1579323
Falcon, W.A., et al.: Pytorch lightning. GitHub, March 2019. https://github.com/PyTorchLightning/pytorch-lightning
Farkas, A., Kertész, G., Lovas, R.: Parallel and distributed training of deep neural networks: a brief overview. In: 2020 IEEE 24th INES, pp. 165–170. IEEE (2020)
Google Scholar
Gao, W., et al.: AIbench: an industry standard internet service AI benchmark suite. arXiv preprint arXiv:1908.08998 (2019)
Goyal, P., et al.: Accurate, large minibatch SGD Training imageNet in 1 hour (2018)
Google Scholar
Gupta, S., Zhang, W., Wang, F.: Model accuracy and runtime tradeoff in distributed deep learning: a systematic study (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pp. 770–778 (2016)
Google Scholar
Hegde, V., Usmani, S.: Parallel and distributed deep learning. In: Tech. report, Stanford University (2016)
Google Scholar
Hey, T., et al.: US Department of Energy, Advanced Scientific Computing Advisory Committee (ASCAC), subcommittee on AI/ML, data-intensive science and high-performance computing, final draft of report to the committee, September 2020
Google Scholar
Johnson, T.B., Agrawal, P., Gu, H., Guestrin, C.: AdaScale SGD: A user-friendly algorithm for distributed training. In: ICLR (2020)
Google Scholar
Koliousis, A., Watcharapichat, P., Weidlich, M., Mai, L., Costa, P., Pietzuch, P.: Crossbow: scaling deep learning with small batch sizes on multi-GPU servers. Proc. VLDB Endow. 12(11), 1399–1412 (2019)
Article Google Scholar
Langer, M., He, Z., Rahayu, W., Xue, Y.: Distributed training of deep learning models: a taxonomic perspective. IEEE Trans. Parallel Distrib. Syst. 31(12), 2802–2818 (2020)
Article Google Scholar
Li, S., et al.: Pytorch distributed: experiences on accelerating data parallel training. arXiv preprint. arXiv:2006.15704 (2020)
Maleki, S., et al.: Scaling distributed training with adaptive summation. In: Proceedings of Machine Learning and Systems 3 (MLSys 2021) (2020)
Google Scholar
Mattson, P., et al.: MLPerf training benchmark. arXiv preprint arXiv:1910.01500 (2019)
Mikami, H., Suganuma, H., Tanaka, Y., Kageyama, Y., et al.: Massively distributed sgd: Imagenet/resnet-50 training in a flash. arXiv preprint arXiv:1811.05233 (2018)
Mikami, H., Suganuma, H., U-chupala, P., Tanaka, Y., Kageyama, Y.: Massively distributed SGD: ImageNet/ResNet-training in a flash (2019)
Google Scholar
NVIDIA: Apex (a pytorch extension), July 2021. https://nvidia.github.io/apex/
Paszke, A., et al. Pytorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)
Google Scholar
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations toward training trillion parameter models. In: SC20, pp. 1–16. IEEE (2020)
Google Scholar
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD, pp. 3505–3506 (2020)
Google Scholar
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint. arXiv:1802.05799 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556 (2014)
Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint. arXiv:1605.07146 (2016)

Download references

Author information

Authors and Affiliations

Costa Rica Institute of Technology, Cartago, Costa Rica
Elvis Rojas & Esteban Meneses
National University of Costa Rica, Pérez Zeledón, Costa Rica
Elvis Rojas
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Terry Jones
National High Technology Center, San José, Costa Rica
Fabricio Quirós-Corella & Esteban Meneses

Authors

Elvis Rojas
View author publications
You can also search for this author in PubMed Google Scholar
Fabricio Quirós-Corella
View author publications
You can also search for this author in PubMed Google Scholar
Terry Jones
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Meneses
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elvis Rojas .

Editor information

Editors and Affiliations

Centro de Investigación y de Estudios Avanzados, Mexico City, Mexico
Isidoro Gitler
Universidad Industrial de Santander, Bucaramanga, Colombia
Carlos Jaime Barrios Hernández
Centro Nacional de Alta Tecnología, San José, Costa Rica
Esteban Meneses

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rojas, E., Quirós-Corella, F., Jones, T., Meneses, E. (2022). Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-04209-6_13
Published: 12 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04208-9
Online ISBN: 978-3-031-04209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch