skip to main content
10.1145/3491087.3493678acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
short-paper

Non-relational multi-level caching for mitigation of staleness & stragglers in distributed deep learning

Published:06 December 2021Publication History

ABSTRACT

For efficient distributed deep neural network design, mitigation of stale gradients and stragglers is necessary. The stale gradient problem occurs during the distribution and parallelism of the deep neural networks on the multi-cluster/nodes. The proposed solution for stragglers is to use distributed non-relational database to update the intermediate results of weights and their respective nodes. The results from the database are given to the parameter server. If any delay in the parameter data due to straggling is detected, immediately the straggled data will be configured in another node as a server-less function. In this approach, each node is equipped with distributed in-memory cache and a non-relational database at the parameter server. The parameter server node is an intelligent node working based on a runtime threshold set and the error analysis for fixing the optimal value of 'K' for the K-SGD. The proposed solution for stale data is to efficiently utilize the multiple GPU with multiple levels of caching in the cloud for better performance and reduced response time. Response time is reduced by offloading and pushing data close to the nodes in multiple levels of distributed cache. Using GPU cache and Elastic Cache, data is updated to the individual nodes in optimal time intervals. Thus, an integrated solution for stragglers and staleness in both data-parallel and model parallel distributed deep learning networks.

References

  1. Assran, M., Loizou, N., Ballas, N. & Rabbat, M. (2019). Stochastic Gradient Push for Distributed Deep Learning., Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning esearch.Google ScholarGoogle Scholar
  2. Johann Schleier-Smith. Serverless foundations for elastic database systems. CIDR, 2019.Google ScholarGoogle Scholar
  3. Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, Klaus Satzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. Sand: Towards high-performance serverless computing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 923--935, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, "Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation", Tech. Rep., 2020Google ScholarGoogle Scholar

Index Terms

  1. Non-relational multi-level caching for mitigation of staleness & stragglers in distributed deep learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader