Skip to main content
Log in

DOSP: an optimal synchronization of parameter server for distributed machine learning

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In distributed machine learning (DML), the straggler problem caused by heterogeneous environment and external factors leads to high synchronization overhead and retards the ML training progress. To alleviate the straggler problem, we propose a new dynamic optimal synchronous parallel (DOSP) strategy that performs partial synchronization based on dynamic clustering of iteration completion time. First, we present a model to calculate the completion time of DML parameter training. Then, we define the optimal synchronization point of partial synchronization scheme and design the synchronization scheme of iteration completion time clustering. Finally, inspired by the delay phenomenon with narrow slot between adjacent synchronization points in synchronization process, we define a gradient aggregation time slot to guide the synchronization evaluation and obtain the optimal synchronization point. The whole idea has been implemented in a prototype called STAR(Our implementation is available at https://github.com/oumiga1314/opt_experient.). Experimental results carried out on STAR show that DOSP improves the training accuracy by 1–3% and the training speed by 1.24–2.93x compared with other existing schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. He X, Guo M, Zhang M (2015) Preface to the big data time machine learning research special issue. J Softw 26(11):2749–2751

    MathSciNet  Google Scholar 

  2. LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  3. Li M, Andersen DG, Park JW, et al (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of 11th Symposium on Operating Systems Design and Implementation (OSDI), pp 583–598. USENIX, Broomfield, CO, USA

  4. Chilimbi TM, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Proceedings of 11th Symposium on Operating Systems Design and Implementation (OSDI), pp 571–582. USENIX, Broomfield, CO, USA

  5. Abadi M, Barham P, Chen J, et al (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of 12th Symposium on Operating Systems Design and Implementation (OSDI), pp 265–283. USENIX, Savannah, GA, USA

  6. Smola AJ, Narayanamurthy SM (2010) An architecture for parallel topic models. Proc VLDB Endow 3(1–2):703–710

    Article  Google Scholar 

  7. Harlap A, Cui H, Dai W, et al (2016) Addressing the straggler problem for iterative convergent parallel ml. In: Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC), pp 98–111. ACM, Santa Clara, CA, USA

  8. Acar UA, Charguéraud A, Rainey M (2013) Scheduling parallel programs by work stealing with private deques. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp 219–228. ACM, Shenzhen, China

  9. Chen C, Weng Q, Wang W, et al (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM Symposium on Cloud Computing (SoCC), p 521. ACM, Carlsbad, CA, USA

  10. Dean J, Corrado G, Monga R, et al (2012) Large scale distributed deep networks. In: Proceedings of 26th Annual Conference on Neural Information Processing Systems (NIPS), pp 1232–1240. IEEE, Lake Tahoe, Nevada, United States

  11. Recht B, Ré C, Wright SJ, et al (2011) F. N.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of 25th Annual Conference on Neural Information Processing Systems (NIPS), pp 693–701. IEEE, Granada, Spain

  12. Dai W, Kumar A, Wei J, et al (2015) High-performance distributed ML at scale through parameter server consistency models. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), pp 79–87. AAAI Press, Austin, Texas, USA

  13. Li Y, Wan H, Jiang B, Long X (2016) More effective synchronization scheme in ml using stale parameters. In: Proceedings of 18th IEEE International Conference on High Performance Computing (HPCC), pp 757–764. IEEE, Sydney, Australia

  14. Wei J, Dai W, Qiao A, et al (2015) Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC), pp 381–394. ACM, Kohala Coast, Hawaii, USA

  15. Lu H, Wang K (2021) Distributed machine learning based mitigating straggler in big data environment. In: Proceedings of International Conference on Communications (ICC), pp 1–6. IEEE, Montreal, QC, Canada

  16. Ho Q, Cipar J, Cui H, et al (2013) More effective distributed ML via a stale synchronous parallel parameter server. In: Proceedings of 27th Annual Conference on Neural Information Processing Systems 2013 (NIPS), pp 1223–1231. IEEE, Lake Tahoe, Nevada, United States

  17. Zhang H, Hu Z, Wei J, et al (2015) Poseidon: a system architecture for efficient gpu-based deep learning on multiple machines. CoRR abs/1512.06216

  18. Zhou Q, Guo S, Lu H et al (2021) Falcon: addressing stragglers in heterogeneous parameter server via multiple parallelism. IEEE Trans Comput 70(1):139–155

    Article  Google Scholar 

  19. Cipar J, Ho Q, Kim JK, et al (2013) Solving the straggler problem with bounded staleness. In: Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). USENIX, Santa Ana Pueblo, New Mexico, USA

  20. Jiang J, Cui B, Zhang C, Yu, L (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD), pp 463–478. ACM, Chicago, IL, USA

  21. Chahal KS, Grover MS, Dey K, Shah RR (2020) A hitchhiker’s guide on distributed training of deep neural networks. J Parallel Distrib Comput 137:65–76

    Article  Google Scholar 

  22. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  23. Zaharia M, Chowdhury M, Das T, et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp 15–28. USENIX, San Jose, CA, USA

  24. Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) Graphx: a resilient distributed graph system on spark. In: Proceedings of the 1th International Workshop on Graph Data Management Experiences and Systems (GRADES), pp 1–6. CWI/ACM, New York, NY, USA

  25. Zhang R, Shen G, Gong L, Guo C (2020) Dsana: a distributed machine learning acceleration solution based on dynamic scheduling and network acceleration. In: Proceedings of 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 302–311. IEEE, Yanuca Island, Cuvu, Fiji

  26. Yu H, Zhu Z, Chen X, et al (2019) Accelerating distributed training in heterogeneous clusters via a straggler-aware parameter server. In: Proceedings of 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 200–207. IEEE, Zhangjiajie, China

  27. Ji Z, Zhang X, Fu Z et al (2019) Dbs-sgd algorithm based on performance awareness in distributed deep learning framework. J Comput Res Develop 56(11):2396–2409

    Google Scholar 

  28. Wang H, Qu Z, Guo S et al (2021) Losp: overlap synchronization parallel with local compensation for fast distributed training. IEEE J Sel Areas Commun 39(8):2541–2557

    Article  Google Scholar 

  29. Zhang C, Tian H, Wang W, Yan F (2018) Stay fresh: speculative synchronization for fast distributed machine learning. In: Proceedings of IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp 99–109. IEEE, Vienna, Austria

  30. Zhang W, Gupta S, Lian X, Liu J (2016) Staleness-aware async-sgd for distributed deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pp 2350–2356. IJCAI/AAAI Press, New York, NY, USA

  31. Zhao X, An A, Liu J, Chen BX (2019) Dynamic stale synchronous parallel distributed training for deep learning. In: Proceedings of 39th IEEE International Conference on Distributed Computing Systems (ICDCS), pp 1507–1517. IEEE, Dallas, TX, USA

  32. Diaconescu E (2008) The use of narx neural networks to predict chaotic time series. WSEAS Trans Comput Res 3(3):182–191

    Google Scholar 

  33. http://www.cs.toronto.edu/~kriz/cifar.html

  34. Yang P, Xu, L (2011) A survey of deployment information of delay-based tcp congestion avoidance algorithm for transmitting multimedia data. In: Workshops Proceedings of the Global Communications Conference (GLOBECOM), pp 18–23. IEEE, Houston, Texas, USA

  35. https://www.grpc.io/docs/guides/concepts/

  36. https://keras.io/

  37. https://psutil.readthedocs.io/en/latest/

  38. Harper FM, Konstan JA (2016) The movielens datasets: history and context. ACM Trans Interact Intell Syst 5(4):1–19

    Article  Google Scholar 

  39. http://yann.lecun.com/exdb/MNIST/

  40. He K, Zhang X, Ren S, Sun, J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 1026–1034. IEEE, Santiago, Chile

Download references

Acknowledgements

This work is supported by Youth Science Foundation of Natural Science Foundation of Hunan Province (No.2020JJ5775), The National Natural Science Foundation of China (No. 62172442) and (No. 62172451).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meiguang Zheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, M., Mao, D., Yang, L. et al. DOSP: an optimal synchronization of parameter server for distributed machine learning. J Supercomput 78, 13865–13892 (2022). https://doi.org/10.1007/s11227-022-04422-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04422-6

Keywords

Navigation