Distributed machine learning load balancing strategy in cloud computing services

Li, Mingwei; Zhang, Jilin; Wan, Jian; Ren, Yongjian; Zhou, Li; Wu, Baofu; Yang, Rui; Wang, Jue

doi:10.1007/s11276-019-02042-2

Distributed machine learning load balancing strategy in cloud computing services

Published: 06 July 2019

Volume 26, pages 5517–5533, (2020)
Cite this article

Wireless Networks Aims and scope Submit manuscript

Mingwei Li^1,2,
Jilin Zhang^1,2,3,
Jian Wan^1,2,4,
Yongjian Ren^1,2,
Li Zhou^1,2,
Baofu Wu^1,2,
Rui Yang^1,2 &
…
Jue Wang⁵

1257 Accesses
9 Citations
Explore all metrics

Abstract

Mobile service computing is a new cloud computing model that provides various cloud services for mobile intelligent terminal users through mobile internet access. The quality of service is an essential problem faced by mobile service computing. In this paper, we demonstrate a series of research studies on how to accelerate the training of a distributed machine learning (ML) model based on cloud service. Distributed ML has become the mainstream way of today’s ML models training. In traditional distributed ML based on bulk synchronous parallel, the temporary slowdown of any node in the cluster will delay the calculation of other nodes because of the frequent occurrence of synchronous barriers, resulting in overall performance degradation. Our paper proposes a load balancing strategy named adaptive fast reassignment (AdaptFR). Based on this, we built a distributed parallel computing model called adaptive-dynamic synchronous parallel (A-DSP). A-DSP uses a more relaxed synchronization model to reduce the performance consumption caused by synchronous operations while ensuring the consistency of the model. At the same time, A-DSP also implements the AdaptFR load balancing strategy, which addresses the straggler problem caused by the performance difference between nodes under the premise of ensuring the accuracy of the model. The experiments show that A-DSP can effectively improve the training speed while ensuring the accuracy of the model in the distributed ML model training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A combined priority scheduling method for distributed machine learning

Article Open access 29 May 2023

DOSP: an optimal synchronization of parameter server for distributed machine learning

Article 25 March 2022

Distributed classification for imbalanced big data in distributed environments

Article 06 February 2021

References

Gorbenko, A., Kharchenko, V. S., Tarasyuk, O., Chen, Y., & Romanovsky, A. (2008). The threat of uncertainty in service-oriented architecture. In Serene 2008, rise/efts joint international workshop on software engineering for resilient systems (pp. 49–54). Newcastle Upon Tyne.
Qi, H., Iyengar, S., & Chakrabarty, K. (2001). Multiresolution data integration using mobile agents in distributed sensor networks. Piscataway: IEEE Press.
Book Google Scholar
Haghighi, V., & Moayedian, N. S. (2018). An offloading strategy in mobile cloud computing considering energy and delay constraints. IEEE Access, PP(99), 1.
Google Scholar
Xia, W., & Shen, L. (2018). Joint resource allocation using evolutionary algorithms in heterogeneous mobile cloud computing networks. China Communications, 15(8), 189–204.
Article Google Scholar
Gao, H., Miao, H., Liu, L., Kai, J., & Zhao, K. (2018). Automated quantitative verification for service-based system design: A visualization transform tool perspective. In International journal of software engineering and knowledge engineering(IJSEKE) (Vol. 28, No. 10, pp. 1369–1397).
Gao, H., Duan, Y., Miao, H., & Yin, Y. (2017). An approach to data consistency checking for the dynamic replacement of service process. IEEE Access, 5, 11700–11711.
Article Google Scholar
Zhang, C., Zhao, H., & Deng, S. (2018). A density-based offloading strategy for IoT devices in edge computing systems. IEEE Access, 6, 73520–73530.
Article Google Scholar
Deng, S., Xiang, Z., Yin, J., Taheri, J., & Zomaya, A. Y. (2018). Composition-driven IoT service provisioning in distributed edges. IEEE Access, 6, 54258–54269.
Article Google Scholar
McColl, W. F. (1995). Bulk synchronous parallel computing. In Programming languages for parallel processing (pp. 335–357). Washington: IEEE Computer Society Press.
Gerbessiotis, A. V., & Valiant, L. G. (1994). Direct bulk-synchronous parallel algorithms. Journal of parallel and distributed computing, 22(2), 251–267.
Article Google Scholar
Smola, A. J., & Narayanamurthy, S. (2010). An architecture for parallel topic models. In: VLDB endowment.
Li, M. (2014). Scaling distributed machine learning with the parameter server. In International conference on big data science and computing (p. 1).
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., et al. (2012) Large scale distributed deep networks. In International conference on neural information processing systems (pp. 1223–1231).
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in latent variable models. In Web search and data mining (pp. 123–132).
Cui, H., Tumanov, A., Wei, J., Xu, L., Dai, W., Haber-Kucharsky, J., et al. (2014) Exploiting iterative-ness for parallel ML computations. In ACM Symposium on Cloud Computing (pp. 1–14).
Zhang, J., Tu, H., Ren, Y., Wan, J., Zhou, L., Li, M., et al. (2017). A parameter communication optimization strategy for distributed machine learning in sensors. Sensors, 17(10), 2172.
Article Google Scholar
Zheng, X., Kim, J. K., Ho, Q., & Xing, E. P. (2014). Model-parallel inference for big topic models. arXiv preprint arXiv:1411.2305.
Recht, B., Re, C., Wright, S., & Niu, F. (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems (pp. 693–701).
Zhao, S. Y., & Li, W. J. (2016) Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee. In Thirtieth AAAI conference on artificial intelligence (pp. 2379–2385).
Zhang, J. L., Yuan, J. F., Jian, W., Jie, M., & Wang, J. (2016). Efficient parallel implementation of incompressible pipe flow algorithm based on SIMPLE. Concurrency and Computation Practice and Experience, 28(6), 1751–1766.
Article Google Scholar
Zhang, J., Wan, J., Li, F., Mao, J., Zhuang, L., Yuan, J., et al. (2016). Efficient sparse matrix–vector multiplication using cache oblivious extension quadtree storage format. Future Generation Computer Systems, 54, 490–500.
Article Google Scholar
Ho, Q., Cipar, J., Cui, H., Kim, J. K., Lee, S., Gibbons, P. B., et al. (2013). More effective distributed ml via a stale synchronous parallel parameter server. Advances in Neural Information Processing Systems, 2013(2013), 1223.
Google Scholar
Terry, D. (2013). Replicated data consistency explained through baseball. Communications of the ACM, 56(12), 82–89.
Article Google Scholar
Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179–195.
Article Google Scholar
Yu, J., Hong, C., Rui, Y., & Tao, D. (2018). Multitask autoencoder model for recovering human poses. IEEE Transactions on Industrial Electronics, 65(6), 5060–5068.
Article Google Scholar
Yin, Y., Chen, L., & Wan, J. (2018). Location-aware service recommendation with enhanced probabilistic matrix factorization. IEEE Access, 6, 62815–62825.
Article Google Scholar
Yin, Y., Yu, F., Xu, Y., Yu, L., & Mu, J. (2017). Network location-aware service recommendation with random walk in cyber-physical systems. Sensors, 17(9), 2059.
Article Google Scholar
Gao, H., Huang, W., Yang, X., Duan, Y., & Yin, Y. (2018). Towards service selection for workflow reconfiguration: An interface-based computing. Future Generation Computer Systems, 87, 298–311.
Article Google Scholar
Gao, H., Zhang, K., Yang, J., Wu, F., & Liu, H. (2018). Applying improved particle swarm optimization for dynamic service composition focusing on quality of service evaluations under hybrid networks. International Journal of Distributed Sensor Networks (IJDSN), 14(2), 1–14.
Google Scholar
Gao, H., Chu, D., Duan, Y., & Yin, Y. (2017). The probabilistic model checking based service selection method for business process modeling. International Journal of Software Engineering and Knowledge Engineering, 27(06), 897–923.
Article Google Scholar
Gao, H., Mao, S., Huang, W., & Yang, X. (2018). Applying probabilistic model checking to financial production risk evaluation and control: A case study of Alibaba’s Yu’e Bao. IEEE Transactions on Computational Social Systems, 5(3), 785–795.
Article Google Scholar
Yu, J., Kuang, Z., Zhang, B., Wei, Z., & Fan, J. (2018). Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Transactions on Information Forensics and Security, 13(5), 1317–1332.
Article Google Scholar
Zhang, J., Geng, J., Jian, W., Zhang, Y., & Xiong, N. N. (2018). An automatically learning and discovering human fishing behaviors scheme for CPSCN. IEEE Access, PP(99), 1.
Google Scholar
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., & Guestrin, C. (2012) PowerGraph: Distributed graph-parallel computation on natural graphs. In Usenix conference on operating systems design and implementation (pp. 17–30).
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on networked systems design and implementation (p. 2). USENIX Association.
Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013) Graphx: A resilient distributed graph system on spark. In First international workshop on graph data management experiences and systems (p. 2). ACM.
Chilimbi, T. M., Suzue, Y., Apacible, J., & Kalyanaraman, K. (2014) Project adam: Building an efficient and scalable deep learning training system. In OSDI (Vol. 14, pp. 571–582).
Xing, E., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., et al. (2015). Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2), 49–67.
Article Google Scholar
Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G. R., et al. (2015) Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing (pp. 381–394). ACM.
Zhang, J., Tu, H., Ren, Y., Jian, W., & Wang, J. (2018). An adaptive synchronous parallel strategy for distributed machine learning. IEEE Access, 6(99), 19222–19230.
Article Google Scholar
Zhang, J., Xiao, J., Wan, J., Yang, J., Ren, Y., Si, H., et al. (2017). A parallel strategy for convolutional neural network based on heterogeneous cluster for mobile information system. Mobile Information Systems, 2017, 3824765. https://doi.org/10.1155/2017/3824765
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014) Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 675–678). ACM.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2014). Going deeper with convolutions. In Computer vision and pattern recognition (pp. 1–9).
Dai, W., Kumar, A., Wei, J., Ho, Q., Gibson, G., & Xing, E. P. (2014). High-performance distributed ML at scale through parameter server consistency models. In National conference on artificial intelligence (pp. 79–87).
Li, M., Zhou, L., Yang, Z., Li, A., Xia, F., Andersen, D. G., et al. (2013) Parameter server for distributed machine learning. In Big learning NIPS workshop (Vol. 6, p. 2).
Cun, Y. L., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. (1990). Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems, 2(2), 396–404.
Google Scholar
Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research [Best of the Web]. IEEE Signal Processing Magazine, 29(6), 141–142.
Article Google Scholar
Zhang, J., Sha, C., Wu, Y., Jian, W., Li, Z., Ren, Y., et al. (2016). The novel implicit LU-SGS parallel iterative method based on the diffusion equation of nuclear reactor on GPU cluster. Computer Physics Communications, 211, S0010465516301965.
Google Scholar

Download references

Acknowledgements

This work is partly supported by the National Key Technology Research and Development Program under Grant No. 2018YFB0204001; National Natural Science Foundation of China under Grant Nos. 61672200 and 61572163; Key Technology Research and Development Program of the Zhejiang Province under Grant Nos. 2019C01059, 2019C03135 and 2019C03134; The Zhejiang Natural Science Funds under Grant No. LY17F020029; State Key Laboratory of Computer Architecture Project No. CARCH201712; Hangzhou Dianzi University Postgraduate Research Innovation Fund Program under Grants No. CXJJ2018052.

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Mingwei Li, Jilin Zhang, Jian Wan, Yongjian Ren, Li Zhou, Baofu Wu & Rui Yang
Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou, 310018, China
Mingwei Li, Jilin Zhang, Jian Wan, Yongjian Ren, Li Zhou, Baofu Wu & Rui Yang
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Jilin Zhang
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, Zhejiang, China
Jian Wan
Supercomputing Center of Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100190, China
Jue Wang

Authors

Mingwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Jilin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wan
View author publications
You can also search for this author in PubMed Google Scholar
Yongjian Ren
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Baofu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Rui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Wan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Zhang, J., Wan, J. et al. Distributed machine learning load balancing strategy in cloud computing services. Wireless Netw 26, 5517–5533 (2020). https://doi.org/10.1007/s11276-019-02042-2

Download citation

Published: 06 July 2019
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11276-019-02042-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed machine learning load balancing strategy in cloud computing services

Abstract

Access this article

Similar content being viewed by others

A combined priority scheduling method for distributed machine learning

DOSP: an optimal synchronization of parameter server for distributed machine learning

Distributed classification for imbalanced big data in distributed environments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed machine learning load balancing strategy in cloud computing services

Abstract

Access this article

Similar content being viewed by others

A combined priority scheduling method for distributed machine learning

DOSP: an optimal synchronization of parameter server for distributed machine learning

Distributed classification for imbalanced big data in distributed environments

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation