research-article

OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning

Authors:

Ruixuan LiAuthors Info & Claims

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Article No.: 82, Pages 1 - 10

https://doi.org/10.1145/3337821.3337828

Published: 05 August 2019 Publication History

Abstract

When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.

References

[1]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017. 440--445.

[2]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. CoRR abs/1711.04325 (2017).

[3]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. 1707--1718.

Digital Library

[4]

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 60, 2 (2018), 223--311.

[5]

Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2018. AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. 2827--2835.

[6]

Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Józefowicz. 2016. Revisiting Distributed Synchronous SGD. CoRR abs/1604.00981 (2016).

[7]

James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric P. Xing. 2013. Solving the Straggler Problem with Bounded Staleness. In 14th Workshop on Hot Topics in Operating Systems, HotOS XIV, Santa Ana Pueblo, New Mexico, USA, May 13-15, 2013.

Digital Library

[8]

Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In 2014 USENIX Annual Technical Conference, USENIX ATC '14, Philadelphia, PA, USA, June 19-20, 2014. 37--48.

Digital Library

[9]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. 1232--1240.

Digital Library

[10]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017).

[11]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013. 6645--6649.

[12]

Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons. 2017. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia, April 23-26, 2017. 589--604.

Digital Library

[13]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. 1223--1231.

Digital Library

[14]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017. 463--478.

Digital Library

[15]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 1725--1732.

Digital Library

[16]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.

[17]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. December 3-6, 2012, Lake Tahoe, Nevada, United States. 1106--1114.

Digital Library

[18]

Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998).

[19]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6-8, 2014. 583--598.

Digital Library

[20]

Mu Li, David G. Andersen, Alexander J. Smola, and Kai Yu. 2014. Communication Efficient Distributed Machine Learning with the Parameter Server. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 19--27.

Digital Library

[21]

Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. 2015. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 2737--2745.

Digital Library

[22]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2018. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018 (2018).

[23]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014. 1058--1062.

[24]

Alexander J. Smola and Shravan M. Narayanamurthy. 2010. An Architecture for Parallel Topic Models. PVLDB 3, 1 (2010), 703--710.

Digital Library

[25]

Suvrit Sra, Adams Wei Yu, Mu Li, and Alexander J. Smola. 2015. AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization. CoRR abs/1508.05003 (2015).

[26]

Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters 119 (2019), 3--11.

Digital Library

[27]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. 1508--1518.

Digital Library

[28]

Dong Yin, Ashwin Pananjady, Maximilian Lam, Dimitris S. Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. 2018. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain. 1998--2007.

[29]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. CoRR abs/1708.03888 (2017).

[30]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. ImageNet Training in Minutes. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, Eugene, OR, USA, August 13-16, 2018. 1:1--1:10.

Digital Library

[31]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017. 181--193.

Digital Library

[32]

Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhiming Ma, and Tie-Yan Liu. 2017. Asynchronous Stochastic Gradient Descent with Delay Compensation. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. 4120--4129.

Digital Library

Cited By

Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Liu YGuo SZhan YWu LHong ZZhou Q(2024)Chiron: A Robustness-Aware Incentive Scheme for Edge Learning via Hierarchical Reinforcement LearningIEEE Transactions on Mobile Computing10.1109/TMC.2024.335065423:8(8508-8524)Online publication date: Aug-2024
https://doi.org/10.1109/TMC.2024.3350654
Pan WWang XZhou PLin W(2024)Time-Sensitive Federated Learning With Heterogeneous Training Intensity: A Deep Reinforcement Learning ApproachIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33453668:2(1402-1415)Online publication date: Apr-2024
https://doi.org/10.1109/TETCI.2023.3345366
Show More Cited By

Recommendations

Heterogeneity-aware Distributed Parameter Servers
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

We study distributed machine learning in heterogeneous environments in this work. We first conduct a systematic study of existing systems running distributed stochastic gradient descent; we find that, although these systems work well in homogeneous ...
Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates
Abstract
The stochastic gradient descent (SGD) optimization algorithm is one of the central tools used to approximate solutions of stochastic optimization problems arising in machine learning and, in particular, deep learning applications. It ...
The M/M/1 queue with synchronized abandonments

In this paper we present a detailed analysis of a single server Markovian queue with impatient customers. Instead of the standard assumption that customers perform independent abandonments, we consider situations where customers abandon the system ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

August 2019

1107 pages

ISBN:9781450362955

DOI:10.1145/3337821

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2019

ICPP 2019: 48th International Conference on Parallel Processing

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
388
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Provatas NKonstantinou IKoziris N(2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3535085
Liu YGuo SZhan YWu LHong ZZhou Q(2024)Chiron: A Robustness-Aware Incentive Scheme for Edge Learning via Hierarchical Reinforcement LearningIEEE Transactions on Mobile Computing10.1109/TMC.2024.335065423:8(8508-8524)Online publication date: Aug-2024
https://doi.org/10.1109/TMC.2024.3350654
Pan WWang XZhou PLin W(2024)Time-Sensitive Federated Learning With Heterogeneous Training Intensity: A Deep Reinforcement Learning ApproachIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33453668:2(1402-1415)Online publication date: Apr-2024
https://doi.org/10.1109/TETCI.2023.3345366
Yang DHu BLiu AJin AYeung KYou Y(2024)WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallelParallel Computing10.1016/j.parco.2024.103092121(103092)Online publication date: Sep-2024
https://doi.org/10.1016/j.parco.2024.103092
Zeng YXue MXu PShi YZeng KZhang JYue L(2024)A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine LearningCollaborative Computing: Networking, Applications and Worksharing10.1007/978-3-031-54531-3_21(385-403)Online publication date: 23-Feb-2024
https://doi.org/10.1007/978-3-031-54531-3_21
Ni CDu H(2023)Baileys: An Efficient Distributed Machine Learning Framework by Dynamic GroupingProceedings of the 2023 15th International Conference on Machine Learning and Computing10.1145/3587716.3587731(92-96)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3587716.3587731
Liu YZhang XZhao YHe YYu SZhu K(2023)Chronos: Accelerating Federated Learning With Resource Aware Training Volume Tuning at Network EdgesIEEE Transactions on Vehicular Technology10.1109/TVT.2022.321815572:3(3889-3903)Online publication date: Mar-2023
https://doi.org/10.1109/TVT.2022.3218155
Wu FGuo SWang HZhang HQu ZZhang JLiu Z(2023)From Deterioration to Acceleration: A Calibration Approach to Rehabilitating Step Asynchronism in Federated OptimizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325051334:5(1548-1559)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3250513
Wang ZTu YWang NGao LNie JWei ZGu YYu G(2023)FSP: Towards Flexible Synchronous Parallel Frameworks for Distributed Machine LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322873334:2(687-703)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TPDS.2022.3228733
Zeng MWang XPan WZhou P(2023)Heterogeneous Training Intensity for Federated Learning: A Deep Reinforcement Learning ApproachIEEE Transactions on Network Science and Engineering10.1109/TNSE.2022.322544410:2(990-1002)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TNSE.2022.3225444
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten