Skip to main content

Learnae: Distributed and Resilient Deep Neural Network Training for Heterogeneous Peer to Peer Topologies

  • Conference paper
  • First Online:
Engineering Applications of Neural Networks (EANN 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1000))

Abstract

Learnae is a framework proposal for decentralized training of Deep Neural Networks (DNN). The main priority of Learnae is to maintain a fully distributed architecture, where no participant has any kind of coordinating role. This solid peer-to-peer concept covers all aspects: Underlying network protocols, data acquiring/distribution and model training. The result is a resilient DNN training system with no single point of failure. Learnae focuses on use cases where infrastructure heterogeneity and network unreliability result to an always changing environment of commodity-hardware nodes. In order to achieve this level of decentralization, new technologies had to be utilized. The main pillars of this implementation are the ongoing projects of IPFS and IOTA. IPFS is a platform for a purely decentralized filesystem, where each node contributes local data storage. IOTA aims to be the networking infrastructure of the upcoming IoT reality. On top of these, we propose a management algorithm for training a DNN model collaboratively, by optimal exchange of data and model weights, always using distribution-friendly gossip protocols.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283 (2016)

    Google Scholar 

  2. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

    Google Scholar 

  3. Zhang, S., Choromanska, A., LeCun, Y.: Deep learning with elastic averaging SGD. In: Advances in Neural Information Processing Systems, pp. 685–693 (2015)

    Google Scholar 

  4. Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. In: Proceedings of LearningSys (2015)

    Google Scholar 

  5. Iandola, F.N., Ashraf, K., Moskewicz, M.W., Keutzer, K.: FireCaffe: near-linear acceleration of deep neural network training on compute clusters (2015)

    Google Scholar 

  6. Langer, M., Hall, A., He, Z., Rahayu, W.: MPCA SGD—a method for distributed training of deep learning models on spark. IEEE Trans. Parallel Distrib. Syst. (2018)

    Google Scholar 

  7. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, pp. 15–28 (2012)

    Google Scholar 

  8. Benet, J.: IPFS - Content Addressed, Versioned, P2P File System

    Google Scholar 

  9. Mashtizadeh, A.J., Bittau, A., Huang, Y.F., Mazieres, D.: Replication, history, and grafting in the Ori file system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 151–166. ACM (2013)

    Google Scholar 

  10. Cohen, B.: Incentives build robustness in BitTorrent. In: Workshop on Economics of Peer-to-Peer Systems, vol. 6, pp. 68–72 (2003)

    Google Scholar 

  11. Baumgart, I., Mies, S.: S/Kademlia: a practicable approach towards secure key based routing. In: Parallel and Distributed Systems International Conference (2007)

    Google Scholar 

  12. Freedman, M.J., Freudenthal, E., Mazieres, D.: Democratizing content publication with coral. In: NSDI, vol. 4, p. 18 (2004)

    Google Scholar 

  13. Wang, L., Kangasharju, J.: Measuring large-scale distributed systems: case of BitTorrent mainline DHT. In: 2013 IEEE Thirteenth International Conference on Peer-to-Peer Computing (P2P), pp. 1–10. IEEE (2013)

    Google Scholar 

  14. Levin, D., LaCurts, K., Spring, N., Bhattacharjee, B.: BitTorrent is an auction: analyzing and improving BitTorrent’s incentives. In: ACM SIGCOMM Computer Communication Review, vol. 38, pp. 243–254. ACM (2008)

    Google Scholar 

  15. Dean, J., Ghemawat, S.: LevelDB–a fast and lightweight key/value database library by Google (2011)

    Google Scholar 

  16. IOTA Foundation. https://www.iota.org. Accessed 14 Feb 2019

  17. Popov, S.: The Tangle, 30 April 2018

    Google Scholar 

  18. Popov, S., Saa, O., Finardi, P.: Equilibria in the Tangle, 3 March 2018

    Google Scholar 

  19. Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., Andrew, N.: Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, pp. 1337–1345 (2013)

    Google Scholar 

  20. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2014)

    Google Scholar 

  21. Miao, Y., Zhang, H., Metze, F.: Distributed learning of multilingual DNN feature extractors using GPUs (2014)

    Google Scholar 

  22. Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural gradient and parameter averaging (2014)

    Google Scholar 

  23. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

    Google Scholar 

  24. HEPMASS Dataset. http://archive.ics.uci.edu/ml/datasets/hepmass. Accessed 14 Feb 2019

Download references

Acknowledgements

This research is funded by the University of Macedonia Research Committee as part of the “Principal Research 2019” funding program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Spyridon Nikolaidis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nikolaidis, S., Refanidis, I. (2019). Learnae: Distributed and Resilient Deep Neural Network Training for Heterogeneous Peer to Peer Topologies. In: Macintyre, J., Iliadis, L., Maglogiannis, I., Jayne, C. (eds) Engineering Applications of Neural Networks. EANN 2019. Communications in Computer and Information Science, vol 1000. Springer, Cham. https://doi.org/10.1007/978-3-030-20257-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20257-6_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20256-9

  • Online ISBN: 978-3-030-20257-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics