DOSP: an optimal synchronization of parameter server for distributed machine learning

Zheng, Meiguang; Mao, Dongbang; Yang, Liu; Wei, Yeming; Hu, Zhigang

doi:10.1007/s11227-022-04422-6

DOSP: an optimal synchronization of parameter server for distributed machine learning

Published: 25 March 2022

Volume 78, pages 13865–13892, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Meiguang Zheng ORCID: orcid.org/0000-0001-8084-5203¹,
Dongbang Mao²,
Liu Yang¹,
Yeming Wei¹ &
…
Zhigang Hu¹

406 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

In distributed machine learning (DML), the straggler problem caused by heterogeneous environment and external factors leads to high synchronization overhead and retards the ML training progress. To alleviate the straggler problem, we propose a new dynamic optimal synchronous parallel (DOSP) strategy that performs partial synchronization based on dynamic clustering of iteration completion time. First, we present a model to calculate the completion time of DML parameter training. Then, we define the optimal synchronization point of partial synchronization scheme and design the synchronization scheme of iteration completion time clustering. Finally, inspired by the delay phenomenon with narrow slot between adjacent synchronization points in synchronization process, we define a gradient aggregation time slot to guide the synchronization evaluation and obtain the optimal synchronization point. The whole idea has been implemented in a prototype called STAR(Our implementation is available at https://github.com/oumiga1314/opt_experient.). Experimental results carried out on STAR show that DOSP improves the training accuracy by 1–3% and the training speed by 1.24–2.93x compared with other existing schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FLSGD: free local SGD with parallel synchronization

Article 03 March 2022

Chorus: More Efficient Machine Learning on Serverless Platform

Distributed machine learning load balancing strategy in cloud computing services

Article 06 July 2019

References

He X, Guo M, Zhang M (2015) Preface to the big data time machine learning research special issue. J Softw 26(11):2749–2751
MathSciNet Google Scholar
LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Li M, Andersen DG, Park JW, et al (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of 11th Symposium on Operating Systems Design and Implementation (OSDI), pp 583–598. USENIX, Broomfield, CO, USA
Chilimbi TM, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Proceedings of 11th Symposium on Operating Systems Design and Implementation (OSDI), pp 571–582. USENIX, Broomfield, CO, USA
Abadi M, Barham P, Chen J, et al (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of 12th Symposium on Operating Systems Design and Implementation (OSDI), pp 265–283. USENIX, Savannah, GA, USA
Smola AJ, Narayanamurthy SM (2010) An architecture for parallel topic models. Proc VLDB Endow 3(1–2):703–710
Article Google Scholar
Harlap A, Cui H, Dai W, et al (2016) Addressing the straggler problem for iterative convergent parallel ml. In: Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC), pp 98–111. ACM, Santa Clara, CA, USA
Acar UA, Charguéraud A, Rainey M (2013) Scheduling parallel programs by work stealing with private deques. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp 219–228. ACM, Shenzhen, China
Chen C, Weng Q, Wang W, et al (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM Symposium on Cloud Computing (SoCC), p 521. ACM, Carlsbad, CA, USA
Dean J, Corrado G, Monga R, et al (2012) Large scale distributed deep networks. In: Proceedings of 26th Annual Conference on Neural Information Processing Systems (NIPS), pp 1232–1240. IEEE, Lake Tahoe, Nevada, United States
Recht B, Ré C, Wright SJ, et al (2011) F. N.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of 25th Annual Conference on Neural Information Processing Systems (NIPS), pp 693–701. IEEE, Granada, Spain
Dai W, Kumar A, Wei J, et al (2015) High-performance distributed ML at scale through parameter server consistency models. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), pp 79–87. AAAI Press, Austin, Texas, USA
Li Y, Wan H, Jiang B, Long X (2016) More effective synchronization scheme in ml using stale parameters. In: Proceedings of 18th IEEE International Conference on High Performance Computing (HPCC), pp 757–764. IEEE, Sydney, Australia
Wei J, Dai W, Qiao A, et al (2015) Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC), pp 381–394. ACM, Kohala Coast, Hawaii, USA
Lu H, Wang K (2021) Distributed machine learning based mitigating straggler in big data environment. In: Proceedings of International Conference on Communications (ICC), pp 1–6. IEEE, Montreal, QC, Canada
Ho Q, Cipar J, Cui H, et al (2013) More effective distributed ML via a stale synchronous parallel parameter server. In: Proceedings of 27th Annual Conference on Neural Information Processing Systems 2013 (NIPS), pp 1223–1231. IEEE, Lake Tahoe, Nevada, United States
Zhang H, Hu Z, Wei J, et al (2015) Poseidon: a system architecture for efficient gpu-based deep learning on multiple machines. CoRR abs/1512.06216
Zhou Q, Guo S, Lu H et al (2021) Falcon: addressing stragglers in heterogeneous parameter server via multiple parallelism. IEEE Trans Comput 70(1):139–155
Article Google Scholar
Cipar J, Ho Q, Kim JK, et al (2013) Solving the straggler problem with bounded staleness. In: Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). USENIX, Santa Ana Pueblo, New Mexico, USA
Jiang J, Cui B, Zhang C, Yu, L (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD), pp 463–478. ACM, Chicago, IL, USA
Chahal KS, Grover MS, Dey K, Shah RR (2020) A hitchhiker’s guide on distributed training of deep neural networks. J Parallel Distrib Comput 137:65–76
Article Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Zaharia M, Chowdhury M, Das T, et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp 15–28. USENIX, San Jose, CA, USA
Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) Graphx: a resilient distributed graph system on spark. In: Proceedings of the 1th International Workshop on Graph Data Management Experiences and Systems (GRADES), pp 1–6. CWI/ACM, New York, NY, USA
Zhang R, Shen G, Gong L, Guo C (2020) Dsana: a distributed machine learning acceleration solution based on dynamic scheduling and network acceleration. In: Proceedings of 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 302–311. IEEE, Yanuca Island, Cuvu, Fiji
Yu H, Zhu Z, Chen X, et al (2019) Accelerating distributed training in heterogeneous clusters via a straggler-aware parameter server. In: Proceedings of 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 200–207. IEEE, Zhangjiajie, China
Ji Z, Zhang X, Fu Z et al (2019) Dbs-sgd algorithm based on performance awareness in distributed deep learning framework. J Comput Res Develop 56(11):2396–2409
Google Scholar
Wang H, Qu Z, Guo S et al (2021) Losp: overlap synchronization parallel with local compensation for fast distributed training. IEEE J Sel Areas Commun 39(8):2541–2557
Article Google Scholar
Zhang C, Tian H, Wang W, Yan F (2018) Stay fresh: speculative synchronization for fast distributed machine learning. In: Proceedings of IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp 99–109. IEEE, Vienna, Austria
Zhang W, Gupta S, Lian X, Liu J (2016) Staleness-aware async-sgd for distributed deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pp 2350–2356. IJCAI/AAAI Press, New York, NY, USA
Zhao X, An A, Liu J, Chen BX (2019) Dynamic stale synchronous parallel distributed training for deep learning. In: Proceedings of 39th IEEE International Conference on Distributed Computing Systems (ICDCS), pp 1507–1517. IEEE, Dallas, TX, USA
Diaconescu E (2008) The use of narx neural networks to predict chaotic time series. WSEAS Trans Comput Res 3(3):182–191
Google Scholar
http://www.cs.toronto.edu/~kriz/cifar.html
Yang P, Xu, L (2011) A survey of deployment information of delay-based tcp congestion avoidance algorithm for transmitting multimedia data. In: Workshops Proceedings of the Global Communications Conference (GLOBECOM), pp 18–23. IEEE, Houston, Texas, USA
https://www.grpc.io/docs/guides/concepts/
https://keras.io/
https://psutil.readthedocs.io/en/latest/
Harper FM, Konstan JA (2016) The movielens datasets: history and context. ACM Trans Interact Intell Syst 5(4):1–19
Article Google Scholar
http://yann.lecun.com/exdb/MNIST/
He K, Zhang X, Ren S, Sun, J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 1026–1034. IEEE, Santiago, Chile

Download references

Acknowledgements

This work is supported by Youth Science Foundation of Natural Science Foundation of Hunan Province (No.2020JJ5775), The National Natural Science Foundation of China (No. 62172442) and (No. 62172451).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Central South University, Changsha, China
Meiguang Zheng, Liu Yang, Yeming Wei & Zhigang Hu
Beijing Jingdong Century Trading Co., LTD, Jd.com, Inc., Beijing, China
Dongbang Mao

Authors

Meiguang Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Dongbang Mao
View author publications
You can also search for this author inPubMed Google Scholar
Liu Yang
View author publications
You can also search for this author inPubMed Google Scholar
Yeming Wei
View author publications
You can also search for this author inPubMed Google Scholar
Zhigang Hu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Meiguang Zheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, M., Mao, D., Yang, L. et al. DOSP: an optimal synchronization of parameter server for distributed machine learning. J Supercomput 78, 13865–13892 (2022). https://doi.org/10.1007/s11227-022-04422-6

Download citation

Accepted: 01 March 2022
Published: 25 March 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11227-022-04422-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DOSP: an optimal synchronization of parameter server for distributed machine learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FLSGD: free local SGD with parallel synchronization

Chorus: More Efficient Machine Learning on Serverless Platform

Distributed machine learning load balancing strategy in cloud computing services

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now