skip to main content
10.1145/3339186.3339203acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Reducing global reductions in large-scale distributed training

Published: 05 August 2019 Publication History

Abstract

Current large-scale training of deep neural networks typically employs synchronous stochastic gradient descent that incurs large communication overhead. Instead of optimizing reduction routines as done in recent studies, we propose algorithms that do not require frequent global reductions. We first show that reducing the global reduction frequency works as an effective regularization technique that improves generalization of adaptive optimizers. We then propose an algorithm that reduces the global reduction frequency by employing local reductions on a subset of learners. In addition, to maximize the effect of reduction on convergence, we introduce reduction momentum that further accelerates convergence.
Our experiment with the CIFAR-10 dataset shows that for the K-step averaging algorithm extremely sparse reductions help bridge the generalization gap. With 6 GPUs, in comparison to regular synchronous implementations, our implementation reduces more than 99% of global reductions. Further, we show that with 32 GPUs, our implementation reduces the number of global reductions by half. Experimenting with the ImageNet-1K dataset, we show that combining local reductions with global reductions and applying reduction momentum can further reduce global reductions by up to 62% for the same validation accuracy achieved in comparison to K-step averaging. With 400 GPUs global reduction frequency is reduced to once per 102K samples.

References

[1]
T. Akiba, S. Suzuki, and K. Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. CoRR abs/1711.04325 (2017). arXiv:1711.04325 http://arxiv.org/abs/1711.04325
[2]
L. Bottou. 1998. Online Learning and Stochastic Approximations.
[3]
K.L. Chung. 1954. On a stochastic approximation method. The Annals of Mathematical Statistics (1954), 463--483.
[4]
J. Dean, G. Corrado, R. Monga, and et al. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1223--1231.
[5]
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, Jan (2012), 165--202.
[6]
J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (July 2011), 2121--2159. http://dl.acm.org/citation.cfm?id=1953048.2021068
[7]
S. Ghadimi and G. Lan. 2013. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23, 4 (2013), 2341--2368.
[8]
P. Goyal, P. Dollár, R.B. Girshick, and et al. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017). arXiv:1706.02677 http://arxiv.org/abs/1706.02677
[9]
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
[10]
A.G. Howard, M. Zhu, B. Chen, and et al. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861
[11]
X. Jia, S. Song, W. He, and et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. ArXiv e-prints (July 2018). arXiv:1807.11205
[12]
N. Shirish Keskar and R. Socher. 2017. Improving Generalization Performance by Switching from Adam to SGD. CoRR abs/1712.07628 (2017). arXiv:1712.07628 http://arxiv.org/abs/1712.07628
[13]
D.P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
[14]
A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. http://www.cs.toronto.edu/~{}kriz/learning-features-2009-TR.pdf
[15]
X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737--2745.
[16]
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19, 4 (2009), 1574--1609.
[17]
A. Paszke, S. Gross, S. Chintala, and et al. 2017. Automatic differentiation in PyTorch. (2017).
[18]
B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.
[19]
H. Robbins and S. Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400--407.
[20]
H. Robbins and D. Siegmund. 1971. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics. Elsevier, 233--257.
[21]
O. Russakovsky, J. Deng, H. Su, and et al. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs/1409.0575 (2014). arXiv:1409.0575 http://arxiv.org/abs/1409.0575
[22]
J. Sacks. 1958. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics 29, 2 (1958), 373--405.
[23]
O. Shamir and T. Zhang. 2013. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. In ICML (1). 71--79.
[24]
K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
[25]
C. Szegedy, W. Liu, Y. Jia, and et al. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014). arXiv:1409.4842 http://arxiv.org/abs/1409.4842
[26]
A.C. Wilson, R. Roelofs, M. Stern, and et al. 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning. ArXiv e-prints (May 2017). arXiv:stat.ML/1705.08292
[27]
Y. Wu, M. Schuster, Z. Chen, and et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144 http://arxiv.org/abs/1609.08144
[28]
C. Ying, S. Kumar, D. Chen, and et. al. 2018. Image Classification at Supercomputer Scale. CoRR abs/1811.06992 (2018). arXiv:1811.06992 http://arxiv.org/abs/1811.06992
[29]
Y. You, I. Gitman, and B. Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. CoRR abs/1708.03888 (2017). arXiv:1708.03888 http://arxiv.org/abs/1708.03888
[30]
C. Zhang, S. Bengio, M. Hardt, and et al. 2017. Understanding deep learning requires rethinking generalization. https://arxiv.org/abs/1611.03530
[31]
Z. Zhang, L. Ma, Z. Li, and C. Wu. 2017. Normalized Direction-preserving Adam. CoRR abs/1709.04546 (2017). arXiv:1709.04546 http://arxiv.org/abs/1709.04546
[32]
F. Zhou and G. Cong. 2018. On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 3219--3227.
[33]
M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. 2011. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23 (NIPS-10). 2595--2603. http://research.microsoft.com/apps/pubs/default.aspx?id=178845

Cited By

View all
  • (2021)HMA: An Efficient Training Method for NLP ModelsProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461384(20-25)Online publication date: 5-Mar-2021
  • (2020)Poster Abstract: Model Average-based Distributed Training for Sparse Deep Neural NetworksIEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS50562.2020.9162748(1346-1347)Online publication date: Jul-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing
August 2019
241 pages
ISBN:9781450371964
DOI:10.1145/3339186
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2019
ICPP 2019: Workshops
August 5 - 8, 2019
Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)HMA: An Efficient Training Method for NLP ModelsProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461384(20-25)Online publication date: 5-Mar-2021
  • (2020)Poster Abstract: Model Average-based Distributed Training for Sparse Deep Neural NetworksIEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS50562.2020.9162748(1346-1347)Online publication date: Jul-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media