Abstract:
There are two major problems while training large machine learning models using distributed gradient descent. The first is the problem of straggling workers, and the seco...Show MoreMetadata
Abstract:
There are two major problems while training large machine learning models using distributed gradient descent. The first is the problem of straggling workers, and the second is the communication delays in transmitting the computed gradient. In Tree Gradient Coding (TGC) architecture, the workers are arranged in a tree topology, and the data partitions are redundantly assigned to these workers, providing us resilience to straggling workers. In TGC the effect of the communication delays and the partial straggling behavior of the workers are not considered while distributing the computation load. In this paper, an expression for computation load in TGC considering the communication delays and the partial stragglers is derived. Moreover, the proposed technique is implemented on cloud-based VMs and experimental results are obtained. A speedup of up to 23.95% is observed compared to the traditional TGC scheme.
Date of Conference: 09-13 June 2024
Date Added to IEEE Xplore: 20 August 2024
ISBN Information: