skip to main content
10.1145/3302424.3303975acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Fast Distributed Deep Learning over RDMA

Published:25 March 2019Publication History

ABSTRACT

Deep learning emerges as an important new resource-intensive workload and has been successfully applied in computer vision, speech, natural language processing, and so on. Distributed deep learning is becoming a necessity to cope with growing data and model sizes. Its computation is typically characterized by a simple tensor data abstraction to model multi-dimensional matrices, a dataflow graph to model computation, and iterative executions with relatively frequent synchronizations, thereby making it substantially different from Map/Reduce style distributed big data computation.

RPC, commonly used as the communication primitive, has been adopted by popular deep learning frameworks such as TensorFlow, which uses gRPC. We show that RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network. The tensor abstraction and dataflow graph, coupled with an RDMA network, offers the opportunity to reduce the unnecessary overhead (e.g., memory copy) without sacrificing programmability and generality. In particular, from a data access point of view, a remote machine is abstracted just as a "device" on an RDMA channel, with a simple memory interface for allocating, reading, and writing memory regions. Our graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface. The result is up to 169% improvement against an RPC implementation optimized for RDMA, leading to faster convergence in the training process.

References

  1. Retrieved in 2017. CUDA Driver API. http://docs.nvidia.com/cuda/cuda-driver-api.Google ScholarGoogle Scholar
  2. Retrieved in 2017. gRPC - An RPC library and framework. https://github.com/grpc/grpc.Google ScholarGoogle Scholar
  3. Retrieved in 2017. TensorFlow. https://github.com/tensorflow/tensorflow/tree/r1.2.Google ScholarGoogle Scholar
  4. Retrieved in 2017. The Apache Thrift. http://thrift.apache.org.Google ScholarGoogle Scholar
  5. Retrieved in 2017. WTM'15 Machine Translation Dataset. http://www.statmt.org/wmt15.Google ScholarGoogle Scholar
  6. Retrieved in 2017. ZeroMQ. http://zeromq.org/.Google ScholarGoogle Scholar
  7. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, GA, 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrew D. Birrell and Bruce Jay Nelson. 1984. Implementing Remote Procedure Calls. ACM Trans. Comput. Syst. 2, 1 (Feb. 1984), 39--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2016. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In NIPS Workshop on Machine Learning Systems (LearningSys).Google ScholarGoogle Scholar
  10. Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and Haibo Chen. 2016. Fast and General Distributed Transactions Using RDMA and HTM. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 26, 17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792 (2016). http://arxiv.org/abs/1606.07792Google ScholarGoogle Scholar
  12. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14). USENIX. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078Google ScholarGoogle Scholar
  14. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.Google ScholarGoogle Scholar
  15. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25. Curran Associates, Inc. http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast Remote Memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI'14). USENIX. https://www.usenix.org/conference/nsdi14/technical-sessions/dragojevi{ć} Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 54--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA Efficiently for Key-value Services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design Guidelines for High Performance RDMA Systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, Denver, CO, 437--450. https://www.usenix.org/conference/atc16/technical-sessions/presentation/kalia Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, GA, 185--201. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/kalia Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 201--216. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/kim Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.Google ScholarGoogle Scholar
  24. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C.J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097--1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14). USENIX. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/limu Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-sided RDMA Reads to Build a Fast, CPU-efficient Key-value Store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC'13). USENIX, 12. http://dl.acm.org/citation.cfm?id=2535461.2535475 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In 2015 USENIX Annual Technical Conference (USENIX ATC'15). USENIX. https://www.usenix.org/conference/atc15/technical-session/presentation/nelson Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jorge Nocedal Mikhail Smelyanskiy Nitish Shirish Keskar, Dheevatsa Mudigere and Ping Tak Peter Tang. {n. d.}. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In 5th International Conference on Learning Representations (ICLR 17).Google ScholarGoogle Scholar
  29. K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).Google ScholarGoogle Scholar
  30. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS'14). MIT Press, Cambridge, MA, USA, 3104--3112. http://dl.acm.org/citation.cfm?id=2969033.2969173 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs/1512.00567 (2015). http://arxiv.org/abs/1512.00567Google ScholarGoogle Scholar
  32. Tyler Szepesi, Benjamin Cassell, Bernard Wong, Tim Brecht, and Xiaoyi Liu. 2015. Nessie: A Decoupled, Client-Driven, Key-Value Store using RDMA. Technical Report CS-2015-09, University of Waterloo, David R. Cheriton School of Computer Science. (2015).Google ScholarGoogle Scholar
  33. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688Google ScholarGoogle Scholar
  34. Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@Twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 147--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-memory Transaction Processing Using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 87--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O'Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, and Lidong Zhou. 2015. GraM: Scaling Graph Computation to the Trillions. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC'15). ACM, 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen, Ming Wu, Wei Li, and Lidong Zhou. 2017. TuX2: Distributed Graph Computation for Machine Learning. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 669--682. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/xiao Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15). ACM, 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Chris Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Agarwal, Chris Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, and Xuedong Huang. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report. https://www.microsoft.com/en-us/research/publication/an-introduction-to-computational-networks-and-the-computational-network-toolkit/Google ScholarGoogle Scholar

Index Terms

  1. Fast Distributed Deep Learning over RDMA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
        March 2019
        714 pages
        ISBN:9781450362818
        DOI:10.1145/3302424

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader