Abstract
In distributed machine learning (DML), the straggler problem caused by heterogeneous environment and external factors leads to high synchronization overhead and retards the ML training progress. To alleviate the straggler problem, we propose a new dynamic optimal synchronous parallel (DOSP) strategy that performs partial synchronization based on dynamic clustering of iteration completion time. First, we present a model to calculate the completion time of DML parameter training. Then, we define the optimal synchronization point of partial synchronization scheme and design the synchronization scheme of iteration completion time clustering. Finally, inspired by the delay phenomenon with narrow slot between adjacent synchronization points in synchronization process, we define a gradient aggregation time slot to guide the synchronization evaluation and obtain the optimal synchronization point. The whole idea has been implemented in a prototype called STAR(Our implementation is available at https://github.com/oumiga1314/opt_experient.). Experimental results carried out on STAR show that DOSP improves the training accuracy by 1–3% and the training speed by 1.24–2.93x compared with other existing schemes.




















Similar content being viewed by others
References
He X, Guo M, Zhang M (2015) Preface to the big data time machine learning research special issue. J Softw 26(11):2749–2751
LeCun Y, Bengio Y, Hinton GE (2015) Deep learning. Nature 521(7553):436–444
Li M, Andersen DG, Park JW, et al (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of 11th Symposium on Operating Systems Design and Implementation (OSDI), pp 583–598. USENIX, Broomfield, CO, USA
Chilimbi TM, Suzue Y, Apacible J, Kalyanaraman K (2014) Project adam: building an efficient and scalable deep learning training system. In: Proceedings of 11th Symposium on Operating Systems Design and Implementation (OSDI), pp 571–582. USENIX, Broomfield, CO, USA
Abadi M, Barham P, Chen J, et al (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of 12th Symposium on Operating Systems Design and Implementation (OSDI), pp 265–283. USENIX, Savannah, GA, USA
Smola AJ, Narayanamurthy SM (2010) An architecture for parallel topic models. Proc VLDB Endow 3(1–2):703–710
Harlap A, Cui H, Dai W, et al (2016) Addressing the straggler problem for iterative convergent parallel ml. In: Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC), pp 98–111. ACM, Santa Clara, CA, USA
Acar UA, Charguéraud A, Rainey M (2013) Scheduling parallel programs by work stealing with private deques. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp 219–228. ACM, Shenzhen, China
Chen C, Weng Q, Wang W, et al (2018) Fast distributed deep learning via worker-adaptive batch sizing. In: Proceedings of the ACM Symposium on Cloud Computing (SoCC), p 521. ACM, Carlsbad, CA, USA
Dean J, Corrado G, Monga R, et al (2012) Large scale distributed deep networks. In: Proceedings of 26th Annual Conference on Neural Information Processing Systems (NIPS), pp 1232–1240. IEEE, Lake Tahoe, Nevada, United States
Recht B, Ré C, Wright SJ, et al (2011) F. N.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Proceedings of 25th Annual Conference on Neural Information Processing Systems (NIPS), pp 693–701. IEEE, Granada, Spain
Dai W, Kumar A, Wei J, et al (2015) High-performance distributed ML at scale through parameter server consistency models. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), pp 79–87. AAAI Press, Austin, Texas, USA
Li Y, Wan H, Jiang B, Long X (2016) More effective synchronization scheme in ml using stale parameters. In: Proceedings of 18th IEEE International Conference on High Performance Computing (HPCC), pp 757–764. IEEE, Sydney, Australia
Wei J, Dai W, Qiao A, et al (2015) Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC), pp 381–394. ACM, Kohala Coast, Hawaii, USA
Lu H, Wang K (2021) Distributed machine learning based mitigating straggler in big data environment. In: Proceedings of International Conference on Communications (ICC), pp 1–6. IEEE, Montreal, QC, Canada
Ho Q, Cipar J, Cui H, et al (2013) More effective distributed ML via a stale synchronous parallel parameter server. In: Proceedings of 27th Annual Conference on Neural Information Processing Systems 2013 (NIPS), pp 1223–1231. IEEE, Lake Tahoe, Nevada, United States
Zhang H, Hu Z, Wei J, et al (2015) Poseidon: a system architecture for efficient gpu-based deep learning on multiple machines. CoRR abs/1512.06216
Zhou Q, Guo S, Lu H et al (2021) Falcon: addressing stragglers in heterogeneous parameter server via multiple parallelism. IEEE Trans Comput 70(1):139–155
Cipar J, Ho Q, Kim JK, et al (2013) Solving the straggler problem with bounded staleness. In: Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). USENIX, Santa Ana Pueblo, New Mexico, USA
Jiang J, Cui B, Zhang C, Yu, L (2017) Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD), pp 463–478. ACM, Chicago, IL, USA
Chahal KS, Grover MS, Dey K, Shah RR (2020) A hitchhiker’s guide on distributed training of deep neural networks. J Parallel Distrib Comput 137:65–76
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Zaharia M, Chowdhury M, Das T, et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp 15–28. USENIX, San Jose, CA, USA
Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) Graphx: a resilient distributed graph system on spark. In: Proceedings of the 1th International Workshop on Graph Data Management Experiences and Systems (GRADES), pp 1–6. CWI/ACM, New York, NY, USA
Zhang R, Shen G, Gong L, Guo C (2020) Dsana: a distributed machine learning acceleration solution based on dynamic scheduling and network acceleration. In: Proceedings of 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 302–311. IEEE, Yanuca Island, Cuvu, Fiji
Yu H, Zhu Z, Chen X, et al (2019) Accelerating distributed training in heterogeneous clusters via a straggler-aware parameter server. In: Proceedings of 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 200–207. IEEE, Zhangjiajie, China
Ji Z, Zhang X, Fu Z et al (2019) Dbs-sgd algorithm based on performance awareness in distributed deep learning framework. J Comput Res Develop 56(11):2396–2409
Wang H, Qu Z, Guo S et al (2021) Losp: overlap synchronization parallel with local compensation for fast distributed training. IEEE J Sel Areas Commun 39(8):2541–2557
Zhang C, Tian H, Wang W, Yan F (2018) Stay fresh: speculative synchronization for fast distributed machine learning. In: Proceedings of IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp 99–109. IEEE, Vienna, Austria
Zhang W, Gupta S, Lian X, Liu J (2016) Staleness-aware async-sgd for distributed deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pp 2350–2356. IJCAI/AAAI Press, New York, NY, USA
Zhao X, An A, Liu J, Chen BX (2019) Dynamic stale synchronous parallel distributed training for deep learning. In: Proceedings of 39th IEEE International Conference on Distributed Computing Systems (ICDCS), pp 1507–1517. IEEE, Dallas, TX, USA
Diaconescu E (2008) The use of narx neural networks to predict chaotic time series. WSEAS Trans Comput Res 3(3):182–191
Yang P, Xu, L (2011) A survey of deployment information of delay-based tcp congestion avoidance algorithm for transmitting multimedia data. In: Workshops Proceedings of the Global Communications Conference (GLOBECOM), pp 18–23. IEEE, Houston, Texas, USA
Harper FM, Konstan JA (2016) The movielens datasets: history and context. ACM Trans Interact Intell Syst 5(4):1–19
He K, Zhang X, Ren S, Sun, J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp 1026–1034. IEEE, Santiago, Chile
Acknowledgements
This work is supported by Youth Science Foundation of Natural Science Foundation of Hunan Province (No.2020JJ5775), The National Natural Science Foundation of China (No. 62172442) and (No. 62172451).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zheng, M., Mao, D., Yang, L. et al. DOSP: an optimal synchronization of parameter server for distributed machine learning. J Supercomput 78, 13865–13892 (2022). https://doi.org/10.1007/s11227-022-04422-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04422-6