Skip to main content
Log in

Improving the Performance of Distributed TensorFlow with RDMA

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open-sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6\(\times \) performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: A System for Large-Scale Machine Learning. arXiv:1605.08695 (2016)

  2. Kim, H., Park, J., Jang, J., et al.: DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters. arXiv:1602.08191 (2016)

  3. Vishnu, A., Siegel, C., Daily, J.: Distributed TensorFlow with MPI. arXiv:1603.02339 (2016)

  4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

  5. Shi, S., Wang, Q., Xu, P., et al.: Benchmarking State-of-the-Art Deep Learning Software Tools. arXiv:1608.07249 (2016)

  6. Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 (2016)

  7. Google Developers.: Introducing gRPC, a New Open Source HTTP/2 RPC Framework. http://googledevelopers.blogspot.com/2015/02/introducing-grpc-new-opensource-http2.html (2015)

  8. Pfister, G.F.: An introduction to the infiniband architecture. High Perform. Mass Storage Parallel I/O 42, 617–632 (2001)

    Google Scholar 

  9. Mellanox.: The Mellanox Solution to TensorFlow. http://www.mellanox.com/solutions/machine-learning/tensorflow.php (2016)

  10. Ou, L., He, X., Han, J.: An efficient design for fast memory registration in RDMA. J. Netw. Comput. Appl. 32(3), 642–651 (2009)

    Article  Google Scholar 

  11. Sur, S., Jin, H.W., Chai, L., et al.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32–39. ACM (2006)

  12. Frey, P.W., Alonso, G.: Minimizing the hidden cost of RDMA. In: ICDCS’09. 29th IEEE International Conference on Distributed Computing Systems, 2009, pp. 553–560. IEEE (2009)

Download references

Acknowledgements

This research is conducted under Advanced Computer System Architecture (ACSA) Laboratory of University of Science and Technology of China, supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengfan Jia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, C., Liu, J., Jin, X. et al. Improving the Performance of Distributed TensorFlow with RDMA. Int J Parallel Prog 46, 674–685 (2018). https://doi.org/10.1007/s10766-017-0520-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-017-0520-3

Keywords

Navigation