research-article

Reducing global reductions in large-scale distributed training

Authors:

Chih-Chieh Yang,

Fan ZhouAuthors Info & Claims

ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

Article No.: 22, Pages 1 - 9

https://doi.org/10.1145/3339186.3339203

Published: 05 August 2019 Publication History

Abstract

Current large-scale training of deep neural networks typically employs synchronous stochastic gradient descent that incurs large communication overhead. Instead of optimizing reduction routines as done in recent studies, we propose algorithms that do not require frequent global reductions. We first show that reducing the global reduction frequency works as an effective regularization technique that improves generalization of adaptive optimizers. We then propose an algorithm that reduces the global reduction frequency by employing local reductions on a subset of learners. In addition, to maximize the effect of reduction on convergence, we introduce reduction momentum that further accelerates convergence.

Our experiment with the CIFAR-10 dataset shows that for the K-step averaging algorithm extremely sparse reductions help bridge the generalization gap. With 6 GPUs, in comparison to regular synchronous implementations, our implementation reduces more than 99% of global reductions. Further, we show that with 32 GPUs, our implementation reduces the number of global reductions by half. Experimenting with the ImageNet-1K dataset, we show that combining local reductions with global reductions and applying reduction momentum can further reduce global reductions by up to 62% for the same validation accuracy achieved in comparison to K-step averaging. With 400 GPUs global reduction frequency is reduced to once per 102K samples.

References

[1]

T. Akiba, S. Suzuki, and K. Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. CoRR abs/1711.04325 (2017). arXiv:1711.04325 http://arxiv.org/abs/1711.04325

[2]

L. Bottou. 1998. Online Learning and Stochastic Approximations.

[3]

K.L. Chung. 1954. On a stochastic approximation method. The Annals of Mathematical Statistics (1954), 463--483.

[4]

J. Dean, G. Corrado, R. Monga, and et al. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1223--1231.

Digital Library

[5]

O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, Jan (2012), 165--202.

Digital Library

[6]

J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (July 2011), 2121--2159. http://dl.acm.org/citation.cfm?id=1953048.2021068

Digital Library

[7]

S. Ghadimi and G. Lan. 2013. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23, 4 (2013), 2341--2368.

Digital Library

[8]

P. Goyal, P. Dollár, R.B. Girshick, and et al. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017). arXiv:1706.02677 http://arxiv.org/abs/1706.02677

[9]

K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385

[10]

A.G. Howard, M. Zhu, B. Chen, and et al. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861

[11]

X. Jia, S. Song, W. He, and et al. 2018. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. ArXiv e-prints (July 2018). arXiv:1807.11205

[12]

N. Shirish Keskar and R. Socher. 2017. Improving Generalization Performance by Switching from Adam to SGD. CoRR abs/1712.07628 (2017). arXiv:1712.07628 http://arxiv.org/abs/1712.07628

[13]

D.P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).

[14]

A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. http://www.cs.toronto.edu/~{}kriz/learning-features-2009-TR.pdf

[15]

X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737--2745.

Digital Library

[16]

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19, 4 (2009), 1574--1609.

Digital Library

[17]

A. Paszke, S. Gross, S. Chintala, and et al. 2017. Automatic differentiation in PyTorch. (2017).

[18]

B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701.

Digital Library

[19]

H. Robbins and S. Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400--407.

[20]

H. Robbins and D. Siegmund. 1971. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics. Elsevier, 233--257.

[21]

O. Russakovsky, J. Deng, H. Su, and et al. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR abs/1409.0575 (2014). arXiv:1409.0575 http://arxiv.org/abs/1409.0575

[22]

J. Sacks. 1958. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics 29, 2 (1958), 373--405.

[23]

O. Shamir and T. Zhang. 2013. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. In ICML (1). 71--79.

Digital Library

[24]

K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556

[25]

C. Szegedy, W. Liu, Y. Jia, and et al. 2014. Going Deeper with Convolutions. CoRR abs/1409.4842 (2014). arXiv:1409.4842 http://arxiv.org/abs/1409.4842

[26]

A.C. Wilson, R. Roelofs, M. Stern, and et al. 2017. The Marginal Value of Adaptive Gradient Methods in Machine Learning. ArXiv e-prints (May 2017). arXiv:stat.ML/1705.08292

[27]

Y. Wu, M. Schuster, Z. Chen, and et al. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144 http://arxiv.org/abs/1609.08144

[28]

C. Ying, S. Kumar, D. Chen, and et. al. 2018. Image Classification at Supercomputer Scale. CoRR abs/1811.06992 (2018). arXiv:1811.06992 http://arxiv.org/abs/1811.06992

[29]

Y. You, I. Gitman, and B. Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. CoRR abs/1708.03888 (2017). arXiv:1708.03888 http://arxiv.org/abs/1708.03888

[30]

C. Zhang, S. Bengio, M. Hardt, and et al. 2017. Understanding deep learning requires rethinking generalization. https://arxiv.org/abs/1611.03530

[31]

Z. Zhang, L. Ma, Z. Li, and C. Wu. 2017. Normalized Direction-preserving Adam. CoRR abs/1709.04546 (2017). arXiv:1709.04546 http://arxiv.org/abs/1709.04546

[32]

F. Zhou and G. Cong. 2018. On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 3219--3227.

Digital Library

[33]

M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. 2011. Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23 (NIPS-10). 2595--2603. http://research.microsoft.com/apps/pubs/default.aspx?id=178845

Digital Library

Cited By

Yang YLai ZCai LLi D(2021)HMA: An Efficient Training Method for NLP ModelsProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461384(20-25)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3461353.3461384
Yang YLai ZCai LLi D(2020)Poster Abstract: Model Average-based Distributed Training for Sparse Deep Neural NetworksIEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS50562.2020.9162748(1346-1347)Online publication date: Jul-2020
https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162748

Recommendations

Query-monotonic Turing reductions

We study reductions that limit the extreme adaptivity of Turing reductions. In particular, we study reductions that make a rapid, structured progression through the set to which they are reducing: Each query is strictly longer (shorter) than the ...
Query-Monotonic Turing Reductions
Proceedings of the 11th Annual International Conference on Computing and Combinatorics - Volume 3595

We study reductions that limit the extreme adaptivity of Turing reductions. In particular, we study reductions that make a rapid, structured progression through the set to which they are reducing: Each query is strictly longer shorter than the previous ...
Parallel Reductions in λ-Calculus

The notion of parallel reduction is extracted from the simple proof of the Church-Rosser theorem by Tait and Martin-L f. Intuitively, this means to reduce a number of redexes (existing in a -term) simultaneously. Thus in the case of -reduction the effect ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

August 2019

241 pages

ISBN:9781450371964

DOI:10.1145/3339186

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2019

ICPP 2019: Workshops

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
92
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang YLai ZCai LLi D(2021)HMA: An Efficient Training Method for NLP ModelsProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461384(20-25)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3461353.3461384
Yang YLai ZCai LLi D(2020)Poster Abstract: Model Average-based Distributed Training for Sparse Deep Neural NetworksIEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS50562.2020.9162748(1346-1347)Online publication date: Jul-2020
https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162748

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten