Journals & Magazines >IEEE/ACM Transactions on Netw... >Volume: 28 Issue: 4

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propos...Show More

Metadata

Abstract:

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.

Published in: IEEE/ACM Transactions on Networking ( Volume: 28, Issue: 4, August 2020)

Page(s): 1752 - 1764

Date of Publication: 19 June 2020

ISSN Information:

DOI: 10.1109/TNET.2020.2999377

Funding Agency:

Contents

References is not available for this document.

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?