skip to main content
10.1145/3126686.3126749acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Efficient Communications in Training Large Scale Neural Networks

Published: 23 October 2017 Publication History

Abstract

We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many collective communication operations, like broadcasts of parameters or reductions for partial gradient aggregations, which for large messages quickly dominates overall execution time and limits parallel scalability. To address this problem, we develop a new technique for collective operations, referred to as Linear Pipelining (LP). It is tuned to the message sizes that arise in BSP-SGD, and works effectively on multi-GPU systems. Theoretically, the cost of LP is invariant to P, where P is the number of GPUs, while the cost of the more conventional Minimum Spanning Tree (MST) scales like O(log P). LP also demonstrates up to 2x higher bandwidth than Bidirectional Exchange (BE) techniques that are widely adopted by current MPI implementations. We apply these collectives to BSP-SGD, showing that the proposed implementations reduce communication bottlenecks in practice while preserving the attractive convergence properties of BSP-SGD.

References

[1]
Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2014. A reliable effective terascale linear learning system. Journal of Machine Learning Research Vol. 15, 1 (2014), 1111--1133.
[2]
Alekh Agarwal and John C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems. 873--881.
[3]
George Almási, Philip Heidelberger, Charles J. Archer, Xavier Martorell, C. Chris Erway, José E. Moreira, B. Steinmacher-Burow, and Yili Zheng. 2005. Optimization of MPI collective communication on BlueGene/L systems Proceedings of the 19th annual international conference on Supercomputing. ACM, 253--262.
[4]
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, Vol. 19, 13 (2007), 1749--1783.
[5]
Remco Chang, Fumeng Yang, and Marianne Procopio. 2016. From Vision Science to Data Science: Applying Perception to Problems in Big Data. Electronic Imaging, Vol. 2016, 16 (2016), 1--7.
[6]
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th international conference on machine learning. 1337--1345.
[7]
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server Proceedings of the Eleventh European Conference on Computer Systems. ACM, 4.
[8]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.
[9]
Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research Vol. 13, 1 (2012), 165--202.
[10]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research Vol. 12 (2011), 2121--2159.
[11]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.
[12]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and Eric P. Xing. 2013. More effective distributed ml via a stale synchronous parallel parameter server Advances in neural information processing systems. 1223--1231.
[13]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675--678.
[14]
Mu Li, David G. Andersen, Alex J. Smola, and Kai Yu. 2014. Communication efficient distributed machine learning with the parameter server Advances in Neural Information Processing Systems. 19--27.
[15]
Tim Nelson, Andrew D. Ferguson, Da Yu, Rodrigo Fonseca, and Shriram Krishnamurthi. 2015. Exodus: Toward automatic migration of enterprise network configurations to SDNs Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 13.
[16]
Tim Nelson, Da Yu, Yiming Li, Rodrigo Fonseca, and Shriram Krishnamurthi. 2015. Simon: Scriptable interactive monitoring for SDNs. Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. ACM, 19.
[17]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent Advances in Neural Information Processing Systems. 693--701.
[18]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH. 1058--1062.
[19]
Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems. 163--171.
[20]
Galen M. Shipman, Timothy S. Woodall, Richard L. Graham, Arthur B. Maccabe, and Patrick G. Bridges. 2006. Infiniband scalability in Open MPI. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 10--pp.
[21]
Rajeev Thakur and William D. Gropp. 2003. Improving the performance of collective operations in MPICH. Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, 257--267.
[22]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, Vol. 19, 1 (2005), 49--66.
[23]
Linnan Wang, Wu Wei, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016. BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing ICS '16: Proceedings of the 30th ACM on International Conference on Supercomputing. ACM, 4.
[24]
Linnan Wang, Yi Yang, Martin Renqiang Min, and Srimat Chakradhar. 2016. Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent. arXiv preprint arXiv:1603.05544 (2016).
[25]
Joachim Worringen. 2003. Pipelining and overlapping for MPI collective operations Local Computer Networks, 2003. LCN'03. Proceedings. 28th Annual IEEE International Conference on. IEEE, 548--557.
[26]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. Parallelized stochastic gradient descent. In Advances in neural information processing systems. 2595--2603.

Cited By

View all
  • (2024)Modeling and Simulation of Collective Algorithms on HPC Network Topologies using Structural Simulation ToolkitProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00129(909-916)Online publication date: 17-Nov-2024
  • (2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
  • (2022)CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed LearningIEEE/ACM Transactions on Networking10.1109/TNET.2021.310909730:1(148-161)Online publication date: Feb-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
October 2017
558 pages
ISBN:9781450354165
DOI:10.1145/3126686
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning system
  2. mpi collectives
  3. neural networks

Qualifiers

  • Research-article

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Modeling and Simulation of Collective Algorithms on HPC Network Topologies using Structural Simulation ToolkitProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00129(909-916)Online publication date: 17-Nov-2024
  • (2024)SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104767183(104767)Online publication date: Jan-2024
  • (2022)CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed LearningIEEE/ACM Transactions on Networking10.1109/TNET.2021.310909730:1(148-161)Online publication date: Feb-2022
  • (2021)Accelerating distributed deep neural network training with pipelined MPI allreduceCluster Computing10.1007/s10586-021-03370-9Online publication date: 7-Aug-2021
  • (2020)FFT-based Gradient Sparsification for the Distributed Training of Deep Neural NetworksProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392681(113-124)Online publication date: 23-Jun-2020
  • (2020)Structured pruning of recurrent neural networks through neuron selectionNeural Networks10.1016/j.neunet.2019.11.018123(134-141)Online publication date: Mar-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media