A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning

Zeng, Yanguo; Xue, Meiting; Xu, Peiran; Shi, Yukun; Zeng, Kaisheng; Zhang, Jilin; Yue, Lupeng

doi:10.1007/978-3-031-54531-3_21

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 563))

Included in the following conference series:

International Conference on Collaborative Computing: Networking, Applications and Worksharing

90 Accesses

Abstract

With the development of machine learning technology in various fields, such as medical care, smart manufacturing, etc., the data has exploded. It is a challenge to train a deep learning model for different application domains with large-scale data and limited resources of a single device. The distributed machine-learning technology, which uses a parameter server and multiple clients to train a model collaboratively, is an excellent method to solve this problem. However, it needs much communication between different devices with limited communication resources. The stale synchronous parallel method is a mainstream communication method to solve this problem, but it always leads to high synchronization delay and low computing efficiency as the inappropriate delay threshold value set by the user based on experience. This paper proposes a synchronous parallel method with parameters communication prediction for distributed machine learning. It predicts the optimal timing for synchronization, which can solve the problem of long synchronization waiting time caused by the inappropriate threshold settings in the stale synchronous parallel method. Moreover, it allows fast nodes to continue local training while performing global synchronization, which can improve the resource utilization of work nodes. Experimental results show that compared with the delayed synchronous parallel method, the training time and quality, and resource usage of our method are both significantly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahmad, F., et al.: A deep learning architecture for psychometric natural language processing. ACM Trans. Inf. Syst. (TOIS) 38(1), 1–29 (2020)
Article MathSciNet Google Scholar
Dabare, R., Wong, K.W., Shiratuddin, M.F., Koutsakis, P.: A fuzzy data augmentation technique to improve regularisation. Int. J. Intell. Syst. 37(8), 4561–4585 (2022)
Article Google Scholar
Liu, W.-X., Jinjie, L., Cai, J., Zhu, Y., Ling, S., Chen, Q.: DRL-PLink: deep reinforcement learning with private link approach for mix-flow scheduling in software-defined data-center networks. IEEE Trans. Netw. Serv. Manage. 19(2), 1049–1064 (2021)
Article Google Scholar
Kriman, S., et al.: QuartzNet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128. IEEE (2020)
Google Scholar
Wang, Y., Wang, K., Huang, H., Miyazaki, T., Guo, S.: Traffic and computation co-offloading with reinforcement learning in fog computing for industrial applications. IEEE Trans. Industr. Inf. 15(2), 976–986 (2018)
Article Google Scholar
Xu, C., Wang, K., Sun, Y., Guo, S., Zomaya, A.Y.: Redundancy avoidance for big data in data centers: a conventional neural network approach. IEEE Trans. Netw. Sci. Eng. 7(1), 104–114 (2018)
Article MathSciNet Google Scholar
Xu, C., Wang, K., Li, P., Xia, R., Guo, S., Guo, M.: Renewable energy-aware big data analytics in geo-distributed data centers with reinforcement learning. IEEE Trans. Netw. Sci. Eng. 7(1), 205–215 (2018)
Article MathSciNet Google Scholar
Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., Guo, C.: A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 463–479 (2020)
Google Scholar
Liang, X., et al.: Accelerating local SGD for non-IID data using variance reduction. Front. Comp. Sci. 17(2), 172311 (2023)
Article Google Scholar
Lin, G., et al.: Understanding adaptive gradient clipping in DP-SGD, empirically. Int. J. Intell. Syst. 37(11), 9674–9700 (2022)
Article Google Scholar
Gerbessiotis, A.V., Valiant, L.G.: Direct bulk-synchronous parallel algorithms. J. Parallel Distrib. Comput. 22(2), 251–267 (1994)
Article Google Scholar
Wang, Z., et al.: FSP: towards flexible synchronous parallel frameworks for distributed machine learning. IEEE Trans. Parallel Distrib. Syst. 34(2), 687–703 (2022)
Article Google Scholar
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Ho, Q., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Moritz, P., et al.: Ray: a distributed framework for emerging AI applications. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 561–577 (2018)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28 (2012)
Google Scholar
Spark MLlib (2020). http://spark.apache.org/mllib/. Accessed Apr 2020
Wang, H., Guo, S., Li, R.: OSP: overlapping computation and communication in parameter server for fast machine learning. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10 (2019)
Google Scholar
Wang, H., Zhihao, Q., Guo, S., Wang, N., Li, R., Zhuang, W.: LOSP: overlap synchronization parallel with local compensation for fast distributed training. IEEE J. Sel. Areas Commun. 39(8), 2541–2557 (2021)
Article Google Scholar
Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016)
Google Scholar
Xing, E.P., et al.: Petuum: a new platform for distributed machine learning on big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1335–1344 (2015)
Google Scholar
Wei, J., et al.: Managed communication and consistency for fast data-parallel iterative analytics. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 381–394 (2015)
Google Scholar
Khalil, H.: Nonlinear Systems, 3rd edn. Pearson, Upper Saddle River (2001)
Google Scholar
Abouelnaga, Y., Ali, O.S., Rady, H., Moustafa, M.: CIFAR-10: KNN-based ensemble of classifiers. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1192–1195. IEEE (2016)
Google Scholar
MNIST (2020). http://yann.lecun.com/exdb/mnist. Accessed June 2020

Download references

Acknowledgment

I would like to express my gratitude to all those who helped me during the writing of this work. This work is supported by the Key Technology Research and Development Program of China under Grant No. 2022YFB2901200.

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Yanguo Zeng, Peiran Xu, Jilin Zhang & Lupeng Yue
Key Laboratory for Modeling and Simulation of Complex Systems, Ministry of Education, Hangzhou, 310018, China
Jilin Zhang
Data Security Governance Zhejiang Engineering Research Center, Hangzhou, 310018, China
Jilin Zhang
National University of Defense Technology, Changsha, China
Kaisheng Zeng
School of Cyberspace, HangZhou Dianzi University, Hangzhou, China
Meiting Xue & Yukun Shi
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Kaisheng Zeng

Authors

Yanguo Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Meiting Xue
View author publications
You can also search for this author in PubMed Google Scholar
Peiran Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yukun Shi
View author publications
You can also search for this author in PubMed Google Scholar
Kaisheng Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jilin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lupeng Yue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meiting Xue .

Editor information

Editors and Affiliations

Shanghai University, Shanghai, China
Honghao Gao
Xi’an Jiaotong-Liverpool, Suzhou, China
Xinheng Wang
University of Peloponnese, Patra, Greece
Nikolaos Voros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zeng, Y. et al. (2024). A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning. In: Gao, H., Wang, X., Voros, N. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-031-54531-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-54531-3_21
Published: 23 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54530-6
Online ISBN: 978-3-031-54531-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning