skip to main content
10.1145/3587716.3587731acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Baileys: An Efficient Distributed Machine Learning Framework by Dynamic Grouping

Published: 07 September 2023 Publication History

Abstract

Many machine-learning applications rely on distributed machine learning (DML) systems to train models from massive datasets using massive computing resources (e.g., GPUs and TPUs). However, given a DML system in most applications, its parameter synchronization mechanism is fixed and independent from the types and amounts of resources available to the system. In this paper, we argue that given an application, the synchronization mechanism in its DML system should be co-designed with the available resources in the heterogeneous cluster. To this end, we design an efficient parameter synchronization framework called Baileys. First, Baileys retrieves resource information from the heterogeneous clusters and uses such information to partition heterogeneous workers into multiple groups dynamically. Second, Baileys develops efficient group-based parameter synchronization mechanisms to converge model parameters quickly and accurately. We implement a prototype of Baileys and demonstrate its efficiency and efficacy through experiments. Results show that Baileys can reduce block time up to 33.2% and decrease the test error up to 39.7%.

References

[1]
R Alimi, R Penno, Y Yang, S Kiesel, S Previdi, W Roome, S Shalunov, and R Woundy. 2014. Application-Layer Traffic Optimization (ALTO) Protocol. (2014).
[2]
Chen Chen, Wei Wang, and Bo Li. 2019. Round-robin synchronization: Mitigating communication bottlenecks in parameter servers. In IEEE INFOCOM. IEEE, 532–540.
[3]
Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li. 2018. Fast distributed deep learning via worker-adaptive batch sizing. In Proceedings of the ACM Symposium on Cloud Computing. 521–521.
[4]
Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. 2014. A delayed proximal gradient method with linear convergence rate. In IEEE MLSP. 1–6.
[5]
Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In ACM SIGCOMM 2009. 63–74.
[6]
Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NIPS’13. 1223–1231.
[7]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, 2018. Ray: A distributed framework for emerging { AI} applications. In OSDI 2018). 561–577.
[8]
Shriram Sarvotham, Rudolf Riedi, and Richard Baraniuk. 2001. Connection-level analysis and modeling of network traffic. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement.
[9]
Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms. In IEEE INFOCOM. 172–180.
[10]
Ion Stoica, Dawn Song, Raluca Ada Popa, David A Patterson, Michael W Mahoney, Randy H Katz, Anthony D Joseph, Michael Jordan, Joseph M Hellerstein, Joseph Gonzalez, 2017. A Berkeley View of Systems Challenges for AI. (2017).
[11]
Haifeng Sun, Zhiyi Gui, Song Guo, Qi Qi, Jingyu Wang, and Jianxi Liao. 2021. GSSP: Eliminating Stragglers through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster. IEEE Transactions on Cloud Computing (2021).
[12]
Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, and Satoshi Matsuoka. 2020. Scaling distributed deep learning workloads beyond the memory capacity with KARMA. In Supercomputing. 1–15.
[13]
Haozhao Wang, Song Guo, and Ruixuan Li. 2019. Osp: Overlapping computation and communication in parameter server for fast machine learning. In Proceedings of the 48th International Conference on Parallel Processing. 1–10.
[14]
Haozhao Wang, Song Guo, Bin Tang, Ruixuan Li, Yutong Yang, Zhihao Qu, and Yi Wang. 2021. Heterogeneity-aware Gradient Coding for Tolerating and Leveraging Stragglers. IEEE Trans. Comput. (2021).
[15]
Lintao Xian, Bingzhe Li, Jing Liu, Zhongwen Guo, and David HC Du. 2021. H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network Training. IEEE Access 9 (2021), 44049–44058.
[16]
Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2019. Fast distributed deep learning over rdma. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–14.
[17]
Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep learning systems. In SIGKDD’15. 1355–1364.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing
February 2023
619 pages
ISBN:9781450398411
DOI:10.1145/3587716
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed machine learning
  2. Dynamic grouping
  3. Multi-server architecture
  4. System heterogeneity

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMLC 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 31
    Total Downloads
  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media