Improving the Performance of Distributed TensorFlow with RDMA

Jia, Chengfan; Liu, Junnan; Jin, Xu; Lin, Han; An, Hong; Han, Wenting; Wu, Zheng; Chi, Mengxian

doi:10.1007/s10766-017-0520-3

Improving the Performance of Distributed TensorFlow with RDMA

Published: 27 September 2017

Volume 46, pages 674–685, (2018)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Chengfan Jia ORCID: orcid.org/0000-0001-6830-528X¹,
Junnan Liu¹,
Xu Jin¹,
Han Lin¹,
Hong An¹,
Wenting Han¹,
Zheng Wu¹ &
…
Mengxian Chi¹

2240 Accesses
18 Citations
Explore all metrics

Abstract

TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open-sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6$\times $ performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the Performance of Distributed MXNet with RDMA

Article 01 January 2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes

References

Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: A System for Large-Scale Machine Learning. arXiv:1605.08695 (2016)
Kim, H., Park, J., Jang, J., et al.: DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters. arXiv:1602.08191 (2016)
Vishnu, A., Siegel, C., Daily, J.: Distributed TensorFlow with MPI. arXiv:1603.02339 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Shi, S., Wang, Q., Xu, P., et al.: Benchmarking State-of-the-Art Deep Learning Software Tools. arXiv:1608.07249 (2016)
Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 (2016)
Google Developers.: Introducing gRPC, a New Open Source HTTP/2 RPC Framework. http://googledevelopers.blogspot.com/2015/02/introducing-grpc-new-opensource-http2.html (2015)
Pfister, G.F.: An introduction to the infiniband architecture. High Perform. Mass Storage Parallel I/O 42, 617–632 (2001)
Google Scholar
Mellanox.: The Mellanox Solution to TensorFlow. http://www.mellanox.com/solutions/machine-learning/tensorflow.php (2016)
Ou, L., He, X., Han, J.: An efficient design for fast memory registration in RDMA. J. Netw. Comput. Appl. 32(3), 642–651 (2009)
Article Google Scholar
Sur, S., Jin, H.W., Chai, L., et al.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32–39. ACM (2006)
Frey, P.W., Alonso, G.: Minimizing the hidden cost of RDMA. In: ICDCS’09. 29th IEEE International Conference on Distributed Computing Systems, 2009, pp. 553–560. IEEE (2009)

Download references

Acknowledgements

This research is conducted under Advanced Computer System Architecture (ACSA) Laboratory of University of Science and Technology of China, supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403).

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230026, Anhui, China
Chengfan Jia, Junnan Liu, Xu Jin, Han Lin, Hong An, Wenting Han, Zheng Wu & Mengxian Chi

Authors

Chengfan Jia
View author publications
You can also search for this author inPubMed Google Scholar
Junnan Liu
View author publications
You can also search for this author inPubMed Google Scholar
Xu Jin
View author publications
You can also search for this author inPubMed Google Scholar
Han Lin
View author publications
You can also search for this author inPubMed Google Scholar
Hong An
View author publications
You can also search for this author inPubMed Google Scholar
Wenting Han
View author publications
You can also search for this author inPubMed Google Scholar
Zheng Wu
View author publications
You can also search for this author inPubMed Google Scholar
Mengxian Chi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chengfan Jia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, C., Liu, J., Jin, X. et al. Improving the Performance of Distributed TensorFlow with RDMA. Int J Parallel Prog 46, 674–685 (2018). https://doi.org/10.1007/s10766-017-0520-3

Download citation

Received: 27 August 2017
Accepted: 18 September 2017
Published: 27 September 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10766-017-0520-3

Keywords

Part of a collection:

Special issue on Network and Parallel Computing for New Architectures and Applications

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the Performance of Distributed TensorFlow with RDMA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving the Performance of Distributed MXNet with RDMA

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Accelerating Synchronous Distributed Data Parallel Training with Small Batch Sizes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now