research-article

Baileys: An Efficient Distributed Machine Learning Framework by Dynamic Grouping

Authors:

Haizhou DuAuthors Info & Claims

ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing

Pages 92 - 96

https://doi.org/10.1145/3587716.3587731

Published: 07 September 2023 Publication History

Abstract

Many machine-learning applications rely on distributed machine learning (DML) systems to train models from massive datasets using massive computing resources (e.g., GPUs and TPUs). However, given a DML system in most applications, its parameter synchronization mechanism is fixed and independent from the types and amounts of resources available to the system. In this paper, we argue that given an application, the synchronization mechanism in its DML system should be co-designed with the available resources in the heterogeneous cluster. To this end, we design an efficient parameter synchronization framework called Baileys. First, Baileys retrieves resource information from the heterogeneous clusters and uses such information to partition heterogeneous workers into multiple groups dynamically. Second, Baileys develops efficient group-based parameter synchronization mechanisms to converge model parameters quickly and accurately. We implement a prototype of Baileys and demonstrate its efficiency and efficacy through experiments. Results show that Baileys can reduce block time up to 33.2% and decrease the test error up to 39.7%.

References

[1]

R Alimi, R Penno, Y Yang, S Kiesel, S Previdi, W Roome, S Shalunov, and R Woundy. 2014. Application-Layer Traffic Optimization (ALTO) Protocol. (2014).

[2]

Chen Chen, Wei Wang, and Bo Li. 2019. Round-robin synchronization: Mitigating communication bottlenecks in parameter servers. In IEEE INFOCOM. IEEE, 532–540.

[3]

Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, and Bo Li. 2018. Fast distributed deep learning via worker-adaptive batch sizing. In Proceedings of the ACM Symposium on Cloud Computing. 521–521.

Digital Library

[4]

Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. 2014. A delayed proximal gradient method with linear convergence rate. In IEEE MLSP. 1–6.

[5]

Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In ACM SIGCOMM 2009. 63–74.

Digital Library

[6]

Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NIPS’13. 1223–1231.

[7]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, 2018. Ray: A distributed framework for emerging { AI} applications. In OSDI 2018). 561–577.

[8]

Shriram Sarvotham, Rudolf Riedi, and Richard Baraniuk. 2001. Connection-level analysis and modeling of network traffic. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement.

Digital Library

[9]

Shaohuai Shi, Xiaowen Chu, and Bo Li. 2019. MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms. In IEEE INFOCOM. 172–180.

[10]

Ion Stoica, Dawn Song, Raluca Ada Popa, David A Patterson, Michael W Mahoney, Randy H Katz, Anthony D Joseph, Michael Jordan, Joseph M Hellerstein, Joseph Gonzalez, 2017. A Berkeley View of Systems Challenges for AI. (2017).

[11]

Haifeng Sun, Zhiyi Gui, Song Guo, Qi Qi, Jingyu Wang, and Jianxi Liao. 2021. GSSP: Eliminating Stragglers through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster. IEEE Transactions on Cloud Computing (2021).

[12]

Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, and Satoshi Matsuoka. 2020. Scaling distributed deep learning workloads beyond the memory capacity with KARMA. In Supercomputing. 1–15.

[13]

Haozhao Wang, Song Guo, and Ruixuan Li. 2019. Osp: Overlapping computation and communication in parameter server for fast machine learning. In Proceedings of the 48th International Conference on Parallel Processing. 1–10.

Digital Library

[14]

Haozhao Wang, Song Guo, Bin Tang, Ruixuan Li, Yutong Yang, Zhihao Qu, and Yi Wang. 2021. Heterogeneity-aware Gradient Coding for Tolerating and Leveraging Stragglers. IEEE Trans. Comput. (2021).

[15]

Lintao Xian, Bingzhe Li, Jing Liu, Zhongwen Guo, and David HC Du. 2021. H-PS: A Heterogeneous-Aware Parameter Server With Distributed Neural Network Training. IEEE Access 9 (2021), 44049–44058.

[16]

Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lintao Zhang, and Lidong Zhou. 2019. Fast distributed deep learning over rdma. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–14.

Digital Library

[17]

Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep learning systems. In SIGKDD’15. 1355–1364.

Index Terms

Baileys: An Efficient Distributed Machine Learning Framework by Dynamic Grouping
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Client-server architectures
    2. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

A Survey on Distributed Machine Learning

The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of ...
Distributed Machine Learning: Foundations, Trends, and Practices
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In recent years, artificial intelligence has achieved great success in many important applications. Both novel machine learning algorithms (e.g., deep neural networks), and their distributed implementations play very critical roles in the success. In ...
A framework for agent-based distributed machine learning and data mining
AAMAS '07: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems

This paper proposes a framework for agent-based distributed machine learning and data mining based on (i) the exchange of meta-level descriptions of individual learning processes among agents and (ii) online reasoning about learning success and learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing

February 2023

619 pages

ISBN:9781450398411

DOI:10.1145/3587716

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMLC 2023

ICMLC 2023: 2023 15th International Conference on Machine Learning and Computing

February 17 - 20, 2023

Zhuhai, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
31
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents