Loading [a11y]/accessibility-menu.js
Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology | IEEE Journals & Magazine | IEEE Xplore

Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology


Abstract:

To tackle the increasingly larger training data and models, researchers and engineers resort to multiple servers in a data center for distributed machine learning (DML). ...Show More

Abstract:

To tackle the increasingly larger training data and models, researchers and engineers resort to multiple servers in a data center for distributed machine learning (DML). On one hand, DML enables us to leverage the computation power of multiple servers, which can effectively accelerate those computation-intensive tasks. On the other hand, DML also incurs significant communication cost due to parameter synchronization among these servers. In this paper, we want to explore the impact of synchronization topology, including both logical topology and physical topology, on the DML performance. First, we revisit the existing logical topologies, e.g., parameter server and ring allreduce, for parameter synchronization, and we find that these flat synchronization topologies is inefficient when running a large-scale DML training. Therefore, we propose a hierarchical parameter synchronization topology, called HiPS, which can achieve efficient parameter synchronization even on a large scale. Then, we compare two representative physical network topologies, namely, Fat-Tree and BCube. Based on our analyses, BCube has many advantages over Fat-Tree, e.g., higher bandwidth, better load balance, and lower hardware cost. The simulation results also show that BCube is more friendly to RDMA. Relying on the advantages of HiPS and BCube, the GST of “HiPS+BCube” is 12% ~ 70% lower than other combinations. Moreover, when the cluster size increases from 16 to 1024, the performance of “HiPS+BCube” only drops by 6.5%, while the performance of “Ring+BCube” drops by 44.6%. Hence, we believe “HiPS+BCube” is the optimal solution to benefit DML in large scale.
Published in: IEEE/ACM Transactions on Networking ( Volume: 30, Issue: 2, April 2022)
Page(s): 572 - 585
Date of Publication: 08 October 2021

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.