Abstract
TensorFlow is an open-source software library designed for Deep Learning using dataflow graph computation. Thanks to the flexible architecture of TensorFlow, users can deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. In a distributed TensorFlow work process, it uses gRPC to connect between different nodes. However, when deploying training tasks on high performance computing clusters, the performance of gRPC becomes a bottleneck of distributed TensorFlow system. HPC clusters are usually equipped with Infiniband network, in addition to traditional TCP/IP network. But open-sourced TensorFlow has not taken this advantage. We present a RDMA-capable design of TensorFlow. By porting the Tensor send/receive parts of TensorFlow into RDMA verbs, we finally get nearly 6\(\times \) performance improvements over the original distributed TensorFlow, based on gRPC. The TensorFlow system with RDMA support shows a great scalability among the training scale.






Similar content being viewed by others
References
Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: A System for Large-Scale Machine Learning. arXiv:1605.08695 (2016)
Kim, H., Park, J., Jang, J., et al.: DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters. arXiv:1602.08191 (2016)
Vishnu, A., Siegel, C., Daily, J.: Distributed TensorFlow with MPI. arXiv:1603.02339 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Shi, S., Wang, Q., Xu, P., et al.: Benchmarking State-of-the-Art Deep Learning Software Tools. arXiv:1608.07249 (2016)
Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 (2016)
Google Developers.: Introducing gRPC, a New Open Source HTTP/2 RPC Framework. http://googledevelopers.blogspot.com/2015/02/introducing-grpc-new-opensource-http2.html (2015)
Pfister, G.F.: An introduction to the infiniband architecture. High Perform. Mass Storage Parallel I/O 42, 617–632 (2001)
Mellanox.: The Mellanox Solution to TensorFlow. http://www.mellanox.com/solutions/machine-learning/tensorflow.php (2016)
Ou, L., He, X., Han, J.: An efficient design for fast memory registration in RDMA. J. Netw. Comput. Appl. 32(3), 642–651 (2009)
Sur, S., Jin, H.W., Chai, L., et al.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32–39. ACM (2006)
Frey, P.W., Alonso, G.: Minimizing the hidden cost of RDMA. In: ICDCS’09. 29th IEEE International Conference on Distributed Computing Systems, 2009, pp. 553–560. IEEE (2009)
Acknowledgements
This research is conducted under Advanced Computer System Architecture (ACSA) Laboratory of University of Science and Technology of China, supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000403).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jia, C., Liu, J., Jin, X. et al. Improving the Performance of Distributed TensorFlow with RDMA. Int J Parallel Prog 46, 674–685 (2018). https://doi.org/10.1007/s10766-017-0520-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-017-0520-3