Learnae: Distributed and Resilient Deep Neural Network Training for Heterogeneous Peer to Peer Topologies

Nikolaidis, Spyridon; Refanidis, Ioannis

doi:10.1007/978-3-030-20257-6_24

Spyridon Nikolaidis¹¹ &
Ioannis Refanidis¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1000))

Included in the following conference series:

International Conference on Engineering Applications of Neural Networks

2011 Accesses
4 Citations

Abstract

Learnae is a framework proposal for decentralized training of Deep Neural Networks (DNN). The main priority of Learnae is to maintain a fully distributed architecture, where no participant has any kind of coordinating role. This solid peer-to-peer concept covers all aspects: Underlying network protocols, data acquiring/distribution and model training. The result is a resilient DNN training system with no single point of failure. Learnae focuses on use cases where infrastructure heterogeneity and network unreliability result to an always changing environment of commodity-hardware nodes. In order to achieve this level of decentralization, new technologies had to be utilized. The main pillars of this implementation are the ongoing projects of IPFS and IOTA. IPFS is a platform for a purely decentralized filesystem, where each node contributes local data storage. IOTA aims to be the networking infrastructure of the upcoming IoT reality. On top of these, we propose a management algorithm for training a DNN model collaboratively, by optimal exchange of data and model weights, always using distribution-friendly gossip protocols.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283 (2016)
Google Scholar
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)
Google Scholar
Zhang, S., Choromanska, A., LeCun, Y.: Deep learning with elastic averaging SGD. In: Advances in Neural Information Processing Systems, pp. 685–693 (2015)
Google Scholar
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. In: Proceedings of LearningSys (2015)
Google Scholar
Iandola, F.N., Ashraf, K., Moskewicz, M.W., Keutzer, K.: FireCaffe: near-linear acceleration of deep neural network training on compute clusters (2015)
Google Scholar
Langer, M., Hall, A., He, Z., Rahayu, W.: MPCA SGD—a method for distributed training of deep learning models on spark. IEEE Trans. Parallel Distrib. Syst. (2018)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, pp. 15–28 (2012)
Google Scholar
Benet, J.: IPFS - Content Addressed, Versioned, P2P File System
Google Scholar
Mashtizadeh, A.J., Bittau, A., Huang, Y.F., Mazieres, D.: Replication, history, and grafting in the Ori file system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 151–166. ACM (2013)
Google Scholar
Cohen, B.: Incentives build robustness in BitTorrent. In: Workshop on Economics of Peer-to-Peer Systems, vol. 6, pp. 68–72 (2003)
Google Scholar
Baumgart, I., Mies, S.: S/Kademlia: a practicable approach towards secure key based routing. In: Parallel and Distributed Systems International Conference (2007)
Google Scholar
Freedman, M.J., Freudenthal, E., Mazieres, D.: Democratizing content publication with coral. In: NSDI, vol. 4, p. 18 (2004)
Google Scholar
Wang, L., Kangasharju, J.: Measuring large-scale distributed systems: case of BitTorrent mainline DHT. In: 2013 IEEE Thirteenth International Conference on Peer-to-Peer Computing (P2P), pp. 1–10. IEEE (2013)
Google Scholar
Levin, D., LaCurts, K., Spring, N., Bhattacharjee, B.: BitTorrent is an auction: analyzing and improving BitTorrent’s incentives. In: ACM SIGCOMM Computer Communication Review, vol. 38, pp. 243–254. ACM (2008)
Google Scholar
Dean, J., Ghemawat, S.: LevelDB–a fast and lightweight key/value database library by Google (2011)
Google Scholar
IOTA Foundation. https://www.iota.org. Accessed 14 Feb 2019
Popov, S.: The Tangle, 30 April 2018
Google Scholar
Popov, S., Saa, O., Finardi, P.: Equilibria in the Tangle, 3 March 2018
Google Scholar
Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., Andrew, N.: Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, pp. 1337–1345 (2013)
Google Scholar
Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2014)
Google Scholar
Miao, Y., Zhang, H., Metze, F.: Distributed learning of multilingual DNN feature extractors using GPUs (2014)
Google Scholar
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural gradient and parameter averaging (2014)
Google Scholar
Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
HEPMASS Dataset. http://archive.ics.uci.edu/ml/datasets/hepmass. Accessed 14 Feb 2019

Download references

Acknowledgements

This research is funded by the University of Macedonia Research Committee as part of the “Principal Research 2019” funding program.

Author information

Authors and Affiliations

University of Macedonia, 54636, Thessaloniki, Greece
Spyridon Nikolaidis & Ioannis Refanidis

Authors

Spyridon Nikolaidis
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Refanidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Spyridon Nikolaidis .

Editor information

Editors and Affiliations

David Goldman Informatics Centre, University of Sunderland, Sunderland, UK
John Macintyre
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Oxford Brookes University, Oxford, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nikolaidis, S., Refanidis, I. (2019). Learnae: Distributed and Resilient Deep Neural Network Training for Heterogeneous Peer to Peer Topologies. In: Macintyre, J., Iliadis, L., Maglogiannis, I., Jayne, C. (eds) Engineering Applications of Neural Networks. EANN 2019. Communications in Computer and Information Science, vol 1000. Springer, Cham. https://doi.org/10.1007/978-3-030-20257-6_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-20257-6_24
Published: 15 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20256-9
Online ISBN: 978-3-030-20257-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics