DRPS: efficient disk-resident parameter servers for distributed machine learning

Song, Zhen; Gu, Yu; Wang, Zhigang; Yu, Ge

doi:10.1007/s11704-021-0445-2

DRPS: efficient disk-resident parameter servers for distributed machine learning

Research Article
Published: 03 December 2021

Volume 16, article number 164321, (2022)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Zhen Song¹,
Yu Gu¹,
Zhigang Wang² &
…
Ge Yu¹

84 Accesses
12 Citations
1 Altmetric
Explore all metrics

Abstract

Parameter server (PS) as the state-of-the-art distributed framework for large-scale iterative machine learning tasks has been extensively studied. However, existing PS-based systems often depend on memory implementations. With memory constraints, machine learning (ML) developers cannot train large-scale ML models in their rather small local clusters. Moreover, renting large-scale cloud servers is always economically infeasible for research teams and small companies. In this paper, we propose a disk-resident parameter server system named DRPS, which reduces the hardware requirement of large-scale machine learning tasks by storing high dimensional models on disk. To further improve the performance of DRPS, we build an efficient index structure for parameters to reduce the disk I/O cost. Based on this index structure, we propose a novel multi-objective partitioning algorithm for the parameters. Finally, a flexible workerselection parallel model of computation (WSP) is proposed to strike a right balance between the problem of inconsistent parameter versions (staleness) and that of inconsistent execution progresses (straggler). Extensive experiments on many typical machine learning applications with real and synthetic datasets validate the effectiveness of DRPS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On combining system and machine learning performance tuning for distributed data stream applications

Article 17 May 2023

Trade-Offs in Large-Scale Distributed Tuplewise Estimation And Learning

A systematic evaluation of machine learning on serverless infrastructure

Article 20 September 2023

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Li M, Andersen D G, Park J W, Smola A J, Ahmed A, Josifovski V, Long J, Shekita E J, Su B Y. Scaling distributed machine learning with the parameter server. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2014, 583–598
Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M J, Xiao T J, Xu B, Zhang C Y, Zhang Z. MXNet: a flexible and efficient machine learning library for heterogeneous distributed system. 2015, arXiv preprint arXiv: 1512.01274
Xing E P, Ho Q R, Dai W, Kim J K, Wei J L, Lee S H, Zheng X, Xie P T, Kumar A, Yu Y L. Petuum: a new platform for distributed machine learning on big data. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining. 2015, 1335–1344
Abadi M, Barham P, Chen J M, Chen Z F, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P A, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X Q. TensorFlow: a system for large-scale machine learning. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2016, 265–283
Recht B, Re C, Wright S J, Niu F. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Proceeding of the 24th International Conference on Neural Information Processing Systems. 2011, 693–701
Xin D, Macke S, Ma L T, Liu J L, Song S C, Parameswaran A G. Helix: holistic optimization for accelerating iterative machine learning. Proceedings of the VLDB Endowment, 2018, 12(4): 446–460
Article Google Scholar
Huang Y Z, Jin T, Wu Y D, Cai Z K, Yan X, Yang F, Li J F, Guo Y Y, Cheng J. FlexPS: flexible parallelism control in parameter server architecture. Proceedings of the VLDB Endowment, 2018, 11(5): 566–579
Article Google Scholar
Zhang Z P, Cui B, Shao Y X, Yu L L, Jiang J W, Miao X P. PS2: parameter server on spark. In: Proceedings of ACM Conference on Management of Data. 2019, 376–388
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of USENIX Workshop on Hot Topics in Cloud Computing. 2010, 1–7
Cho M, Finkler U, Kung D S, Hunter H C. BlueConnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. In: Proceedings of Conference on Machine Learning and Systems. 2019, 1–11
Yang F, Li J F, Cheng J. Husky: towards a more efficient and expressive distributed computing framework. Proceedings of the VLDB Endowment, 2016, 9(5): 420–431
Article Google Scholar
Jiang Y M, Zhu Y B, Lan C, Yi B, Cui Y, Guo C X. A unified architecture for accelerating distributed dnn training in heterogeneous gpu/cpu clusters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2020, 463–479
Wang Z G, Gu Y, Bao Y B, Yu G, Yu J X. Hybrid pulling/pushing for i/o-efficient distributed and iterative graph computing. In: Proceedings of ACM Conference on Management of Data. 2016, 479–494
Qin C J, Torres M, Rusu F. Scalable asynchronous gradient descent optimization for out-of-core models. Proceedings of the VLDB Endowment, 2017, 10(10): 986–997
Article Google Scholar
Li M, Andersen D G, Smola A J. Graph partitioning via parallel submodular approximation to accelerate distributed machine learning. 2015, arXiv preprint arXiv: 1505.04636
Renz-Wieland A, Gemulla R, Zeuch S, Markl V. Dynamic parameter allocation in parameter servers. Proceedings of the VLDB Endowment, 2020, 13(12): 1877–1890
Article Google Scholar
Chen Y R, Peng Y H, Bao Y X, Wu C, Zhu Y B, Guo C X. Elastic parameter server load distribution in deep learning clusters. In: Proceedings of ACM Symposium on Cloud Computing. 2020, 507–521
Gallet B, Gowanlock M. Heterogeneous cpu-gpu epsilon grid joins: static and dynamic work partitioning strategies. Data Science and Engineering, 2021, 6(1): 39–62
Article Google Scholar
Valiant L G. A bridging model for parallel computation. Communications of the ACM, 1990, 33(8): 103–111
Article Google Scholar
Ho Q R, Cipar J, Cui H G, Lee S H, Kim J K, Gibbons P B, Gibson G A, Ganger G R, Xing E P. More effective distributed ML via a stale synchronous parallel parameter server. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 1223–1231
Li M, Andersen D G, Smola A J, Yu K. Communication efficient distributed machine learning with the parameter server. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 19–27
Fan W F, Lu P, Luo X J, Xu J B, Yin Q, Yu W Y, Xu R Q. Adaptive asynchronous parallelization of graph algorithms. In: Proceedings of the International Conference on Management of Data. 2018, 1141–1156
Jiang J W, Cui B, Zhang C, Yu L L. Heterogeneity-aware distributed parameter servers. In: Proceedings of the ACM International Conference on Management of Data. 2017, 463–478
Wang Z G, Gao L X, Gu Y, Bao Y B, Yu G. FSP: towards flexible synchronous parallel framework for expectation-maximization based algorithms on cloud. In: Proceedings of the Symposium on Cloud Computing. 2017, 1–14

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (2018YFB1003404), the National Natural Science Foundation of China (Grant Nos. 62072083, U1811261, 61902366), Basal Research Fund (N180716010), Liao Ning Revitalization Talents Program (XLYC1807158) and the China Postdoctoral Science Foundation (2020T130623).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, 110819, China
Zhen Song, Yu Gu & Ge Yu
College of Information Science and Engineering, Ocean University of China, Qingdao, 266100, China
Zhigang Wang

Authors

Zhen Song
View author publications
Search author on:PubMed Google Scholar
Yu Gu
View author publications
Search author on:PubMed Google Scholar
Zhigang Wang
View author publications
Search author on:PubMed Google Scholar
Ge Yu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yu Gu.

Additional information

Zhen Song received the master degree in computer software and theory from Northeastern University, China in 2019. He is a PhD candidate in Northeastern University, China. His current research interests include distributed graph computation and distributed machine learning.

Yu Gu received the PhD degree in computer software and theory from Northeastern University, China in 2010. Currently, he is a professor and the PhD supervisor at Northeastern University, China. His current research interests include big data analysis, spatial data management and graph data management. He is a senior member of the China Computer Federation (CCF).

Zhigang Wang received the PhD degree in computer software and theory from Northeastern University, China in 2018. He is currently a lecturer in the College of Information Science and Engineering, Ocean University of China, China. He has been a visiting PhD student in University of Massachusetts Amherst, USA during December 2014 to December 2016. His research interests include cloud computing, distributed graph processing and machine learning.

Ge Yu received the PhD degree in computer science from Kyushu University, Japan in 1996. He is currently a professor and the PhD supervisor at Northeastern University, China. His research interests include distributed and parallel database, OLAP and data warehousing, data integration, graph data management, etc. He is a member of the IEEE Computer Society, IEEE, ACM, and a Fellow of the China Computer Federation (CCF).

Electronic supplementary material