Tree Gradient Coding Considering Communication Delays and Partial Stragglers | IEEE Conference Publication | IEEE Xplore

Tree Gradient Coding Considering Communication Delays and Partial Stragglers


Abstract:

There are two major problems while training large machine learning models using distributed gradient descent. The first is the problem of straggling workers, and the seco...Show More

Abstract:

There are two major problems while training large machine learning models using distributed gradient descent. The first is the problem of straggling workers, and the second is the communication delays in transmitting the computed gradient. In Tree Gradient Coding (TGC) architecture, the workers are arranged in a tree topology, and the data partitions are redundantly assigned to these workers, providing us resilience to straggling workers. In TGC the effect of the communication delays and the partial straggling behavior of the workers are not considered while distributing the computation load. In this paper, an expression for computation load in TGC considering the communication delays and the partial stragglers is derived. Moreover, the proposed technique is implemented on cloud-based VMs and experimental results are obtained. A speedup of up to 23.95% is observed compared to the traditional TGC scheme.
Date of Conference: 09-13 June 2024
Date Added to IEEE Xplore: 20 August 2024
ISBN Information:

ISSN Information:

Conference Location: Denver, CO, USA

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.