skip to main content
10.1145/3343180.3343192acmotherconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

Accelerating Distributed Machine Learning by Smart Parameter Server

Published: 17 August 2019 Publication History

Abstract

Parameter Server (PS)-based architecture is widely applied in distributed machine learning (DML), but it is still an open issue how to improve the DML performance in this frame-work. Existing works mainly focus on the view of workers. In this paper, we tackle this problem from another perspective, by leveraging the central control on the PS. Specifically, we propose SmartPS, which transforms the passive role of PS in traditional DML and fully exploits the intelligence of PS. Firstly, the PS holds the global view of parameter dependency, facilitating it to update workers' parameters selectively and proactively. Secondly, the PS records the workers' speeds, and prioritizes parameter transmission to narrow the gap between stragglers and fast workers. Thirdly, the PS considers the parameter dependency in consecutive training iterations, and opportunistically blocks unnecessary pushes from workers. We conduct comparative experiments with two typical benchmarks, Matrix Factorization (MF) and PageRank (PR). The experimental results prove that, compared with all the baseline algorithms (i.e. standard BSP, ASP and SSP), SmartPS can reduce the overall training time by 65.7%~84.9%, with the same training accuracy.

References

[1]
P. Watcharapichat, V. L. Morales et al., "Ako: Decentralised deep learning with partial gradient exchange," in Proceedings of SoCC '16.
[2]
H. Zhang, Z. Zheng et al., "Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters," in Proceedings of USENIX ATC '17.
[3]
M. Li, D. Andersen, J. W. Park et al., "Scaling distributed machine learning with the parameter server," in Proceedings of OSDI'14, 2014.
[4]
J. Geng, D. Li, Y. Cheng, S. Wang, and J. Li, "HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning," in Proceedings of ACM SIGCOMM Workshop on NetAI'18. New York, NY, USA: ACM, 2018.
[5]
Y. Cheng, D. Li, Z. Guo, B. Jiang, J. Lin, X. Fan, GengJinkun, X. Yu, W. Bai, L. Qu, R. Shu, P. Cheng, Y. Xiong, and J. Wu, "Dlbooster: Boosting end-to-end deep learning workflows with offloading data preprocessing pipelines," in Proceedings of the 48th International Conference on Parallel Processing, ser. ICPP'19, 2019.
[6]
J. Geng, D. Li, and S. Wang, "Elasticpipe: An efficient and dynamic model-parallel solution to dnn training," in Proceedings of 10th workshop on Scientific Cloud Computing, ser. ScienceCloud '19. New York, NY, USA: ACM, 2019.
[7]
S. Wang, D. Li, J. Geng, Y. Gu, and Y. Cheng, "Impact of network topology on the performance of dml: Theoretical analysis and practical factors," in IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, ser. INFOCOM'19, 2019.
[8]
J. Geng, D. Li, and S. Wang, "Horizontal or vertical? a hybrid approach to large-scale distributed machine learning," in Proceedings of 1st Workshop on Converged Computing Infrastructure, ser. CCIW '19. New York, NY, USA: ACM, 2019.
[9]
Y. Cheng, J. Geng, Y. Wang, J. Li, D. Li, and J. Wu, "Bridging machine learning and computer network research: a survey," CCF Transactions on Networking, Nov 2018. {Online}. Available:
[10]
M. Abadi, P. Barham, J. Chen et al., "Tensorflow: A system for large-scale machine learning," in Proceedings of OSDI'16.
[11]
C. Tianqi, L. Mu et al., "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems," arXiv:1512.01274.
[12]
H. Cui, J. Cipar, Q. Ho et al., "Exploiting bounded staleness to speed up big data analytics," in Proceedings of ATC'14.
[13]
W. Dai, A. Kumar et al., "High-performance distributed ml at scale through parameter server consistency models," in Proceedings of AAAI'15.
[14]
C. Jianmin, P. Xinghao, M. Rajat, and et al., "Revisiting distributed synchronous sgd," arXiv:1604.00981, 2016.
[15]
J. Geng, D. Li, and S. Wang, "Rima:an rdma-accelerated model-parallelized solution to large-scale matrix factorization," in Proceedings of 35th IEEE International Conference on Data Engineering, ser. ICDE'19. IEEE, 2019.
[16]
R. Gemulla, E. Nijkamp et al., "Large-scale matrix factorization with distributed stochastic gradient descent," in Proceedings of KDD '11.
[17]
H. Cui, A. Tumanov, J. Wei, and others., "Exploiting iterative-ness for parallel ml computations," in Proceedings of SoCC'14.
[18]
J. Dinan, D. B. Larkins et al., "Scalable work stealing," in Proceedings of SC'09.
[19]
A. Harlap, H. Cui et al., "Addressing the straggler problem for iterative convergent parallel ml," in Proceedings of SoCC'16.
[20]
J. Geng, D. Li, and S. Wang, "Accelerating distributed machine learning by smart parameter server," Department of Computer Science and Technology, Tsinghua University, Tech. Rep., 2019. {Online}. Available: https://cloud.tsinghua.edu.cn/f/b5e741598a5f46da8515/?dl=1

Cited By

View all
  • (2024)A high-performance dataflow-centric optimization framework for deep learning inference on the edgeJournal of Systems Architecture10.1016/j.sysarc.2024.103180152(103180)Online publication date: Jul-2024
  • (2023)Embracing Uncertainty for Equity in Resource Allocation in ML TrainingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605583(423-432)Online publication date: 7-Aug-2023
  • (2023)Offloading Machine Learning to Programmable Data Planes: A Systematic SurveyACM Computing Surveys10.1145/360515356:1(1-34)Online publication date: 26-Aug-2023
  • Show More Cited By

Index Terms

  1. Accelerating Distributed Machine Learning by Smart Parameter Server

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      APNet '19: Proceedings of the 3rd Asia-Pacific Workshop on Networking
      August 2019
      104 pages
      ISBN:9781450376358
      DOI:10.1145/3343180
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 August 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Distributed machine learning (DML)
      2. global view
      3. opportunistically block
      4. parameter dependency
      5. prioritize parameter transmission

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • he National Key Research and Development Program of China
      • the National Natural Science Foundation of China
      • the Research and Development Program in Key Areas of Guangdong Province

      Conference

      APNet '19

      Acceptance Rates

      Overall Acceptance Rate 50 of 118 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)31
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A high-performance dataflow-centric optimization framework for deep learning inference on the edgeJournal of Systems Architecture10.1016/j.sysarc.2024.103180152(103180)Online publication date: Jul-2024
      • (2023)Embracing Uncertainty for Equity in Resource Allocation in ML TrainingProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605583(423-432)Online publication date: 7-Aug-2023
      • (2023)Offloading Machine Learning to Programmable Data Planes: A Systematic SurveyACM Computing Surveys10.1145/360515356:1(1-34)Online publication date: 26-Aug-2023
      • (2023)Tree-Based Elastic Parameter Server to Schedule Resources to Accelerate Distributed 'Training2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC)10.1109/ITAIC58329.2023.10408975(379-382)Online publication date: 8-Dec-2023
      • (2022)GSSP: Eliminating Stragglers Through Grouping Synchronous for Distributed Deep Learning in Heterogeneous ClusterIEEE Transactions on Cloud Computing10.1109/TCC.2021.306239810:4(2637-2648)Online publication date: 1-Oct-2022
      • (2021)DQ-DPS Data Partition Strategy Based on Distributed Machine LearningProceedings of the 2021 2nd International Conference on Artificial Intelligence in Electronics Engineering10.1145/3460268.3460272(20-26)Online publication date: 15-Jan-2021
      • (2021)H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network TrainingIEEE Access10.1109/ACCESS.2021.30601549(44049-44058)Online publication date: 2021
      • (2020)Elastic parameter server load distribution in deep learning clustersProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421307(507-521)Online publication date: 12-Oct-2020
      • (2020)Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge LearningIEEE Transactions on Wireless Communications10.1109/TWC.2020.302117719:12(8272-8286)Online publication date: 1-Dec-2020
      • (2020)Online Resource Allocation With Machine Variability: A Bandit PerspectiveIEEE/ACM Transactions on Networking10.1109/TNET.2020.300690628:5(2243-2256)Online publication date: Oct-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media