ABSTRACT
For efficient distributed deep neural network design, mitigation of stale gradients and stragglers is necessary. The stale gradient problem occurs during the distribution and parallelism of the deep neural networks on the multi-cluster/nodes. The proposed solution for stragglers is to use distributed non-relational database to update the intermediate results of weights and their respective nodes. The results from the database are given to the parameter server. If any delay in the parameter data due to straggling is detected, immediately the straggled data will be configured in another node as a server-less function. In this approach, each node is equipped with distributed in-memory cache and a non-relational database at the parameter server. The parameter server node is an intelligent node working based on a runtime threshold set and the error analysis for fixing the optimal value of 'K' for the K-SGD. The proposed solution for stale data is to efficiently utilize the multiple GPU with multiple levels of caching in the cloud for better performance and reduced response time. Response time is reduced by offloading and pushing data close to the nodes in multiple levels of distributed cache. Using GPU cache and Elastic Cache, data is updated to the individual nodes in optimal time intervals. Thus, an integrated solution for stragglers and staleness in both data-parallel and model parallel distributed deep learning networks.
- Assran, M., Loizou, N., Ballas, N. & Rabbat, M. (2019). Stochastic Gradient Push for Distributed Deep Learning., Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning esearch.Google Scholar
- Johann Schleier-Smith. Serverless foundations for elastic database systems. CIDR, 2019.Google Scholar
- Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. Sand: Towards high-performance serverless computing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 923--935, 2018. Google ScholarDigital Library
- H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, "Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation", Tech. Rep., 2020Google Scholar
Index Terms
- Non-relational multi-level caching for mitigation of staleness & stragglers in distributed deep learning
Recommendations
Rank-Based Prefetching and Multi-level Caching Algorithms to Improve the Efficiency of Read Operations in Distributed File Systems
Big Data AnalyticsAbstractIn the era of big data, web-based applications deployed in cloud computing systems have to store and process large data generated by the users of such applications. Distributed file systems are used as the back end storage component in the cloud ...
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches
Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
Distributed cooperative shared last-level caching in tiled multiprocessor system on chip
DATE '14: Proceedings of the conference on Design, Automation & Test in EuropeIn a shared-memory based tiled many-core system-on-chip architecture, memory accesses present a huge performance bottleneck in terms of access latency as well as bandwidth requirements. The best practice approach to address this issue is to provide a ...
Comments