Improving the Performance of Distributed MXNet with RDMA

Li, Mingfan; Wen, Ke; Lin, Han; Jin, Xu; Wu, Zheng; An, Hong; Chi, Mengxian

doi:10.1007/s10766-018-00623-w

Improving the Performance of Distributed MXNet with RDMA

Published: 01 January 2019

Volume 47, pages 467–480, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Mingfan Li ORCID: orcid.org/0000-0002-1079-3126¹,
Ke Wen¹,
Han Lin¹,
Xu Jin¹,
Zheng Wu¹,
Hong An¹ &
…
Mengxian Chi¹

722 Accesses
10 Citations
Explore all metrics

Abstract

As one of the most influential deep learning frameworks, MXNet has achieved excellent performance and many breakthroughs in academic and industrial fields for various machine learning situations. The initial implementation of MXNet uses proxy-socket interface, which delivers suboptimal performance in distributed environment. In a massive parallel training task, parameters are updated frequently during each training loop, in which case network performance becomes the main factor of overall performance. Over the past decade, high performance interconnects have employed remote direct memory access (RDMA) technology to provide excellent performance for numerous scientific domains. In this paper, we describe an efficient design that extends the open-source MXNet to make it RDMA capable via RDMA-based parameter server interfaces. With modest optimizations towards memory usage and transmission overhead, RDMA-based MXNet achieves great performance improvement over the original software. Our experiments reveal that, for the communication subsystem of MXNet, the new design achieves 16x speedup (up to 21x at peak) over 1 Gigabit Ethernet (1GigE). For the two training cases on MXNet, the optimized implementation gains 5x and 9x speedup, respectively. Compared to experiments on the IP-over-InfiniBand (IPoIB) protocol, it achieves nearly 30% performance improvement, as well as better scalability and alleviation of bottlenecks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

Improving the Performance of Distributed TensorFlow with RDMA

Article 27 September 2017

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Article 21 April 2022

Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

Article 10 January 2022

References

de Bruijne, M.: Machine learning approaches in medical image analysis: from detection to diagnosis. Med. Image. Anal. 33, 94–97 (2016). https://doi.org/10.1016/j.media.2016.06.032
Article Google Scholar
Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018). https://doi.org/10.1109/MCI.2018.2840738
Article Google Scholar
Pérez, G., Arbeláez, P.: Automated detection of lung nodules with three-dimensional convolutional neural networks. Proc. SPIE 10572, 10572-1-10572-10 (2017). https://doi.org/10.1117/12.2285954
Huang G., Sun, Y., Liu, Z., Sedra, D.,Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision, pp. 646–661. Springer (2016)
You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR. arXiv:1709.05011 (2017)
Grun, P., Hefty, S., Sur, S., Goodell, D., Russell, R.D., Pritchard, H., Squyres, J.M.: A brief introduction to the OpenFabrics interfaces: a new network API for maximizing high performance application efficiency. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 34–39 (2015). https://doi.org/10.1109/HOTI.2015.19
Hintjens, P.: ZeroMQ: the guide. http://zguide.zeromq.org/page:all (2010)
MacArthur, P., Liu, Q., Russell, R.D., Mizero, F., Veeraraghavan, M., Dennis, J.M.: An integrated tutorial on InfiniBand, verbs, and MPI. IEEE Commun. Surv. Tutorials 19(4), 2894–2926 (2017). https://doi.org/10.1109/COMST.2017.2746083
Article Google Scholar
RDMA Consortium and others: Architectural specifications for RDMA over TCP/IP (2009)
Li, M., Zhou, L.,Yang, Z., Li, A., Xia, F., Andersen, D.G., Smola, A.: Parameter server for distributed machine learning. In: Big Learning NIPS Workshop, vol. 6, p. 2 (2013)
Buyya, R., Cortes, T., Jin, H.: An introduction to the InfiniBand architecture. In: High Performance Mass Storage and Parallel I/O: Technologies and Applications (2002). https://doi.org/10.1109/9780470544839.ch42
Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Program. 32(3), 167–198 (2004). https://doi.org/10.1023/B:IJPP.0000029272.69895.c1
Article MATH Google Scholar
Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance RDMA-based design of HDFS over InfiniBand. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 35. IEEE Computer Society Press (2012)
Jose, J., Subramoni, H., Luo, M., Zhang, M., Huang, J., Wasi-ur Rahman, M., Islam, N.S., Ouyang, X., Wang, H., Sur, S., et al.: Memcached design on high performance rdma capable interconnects. In: 2011 International Conference on Parallel Processing (ICPP), pp. 743–752. IEEE (2011)
Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI runtimes: experience with MVAPICH. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, p. 5. ACM (2010)
Jia, C., Liu, J., Jin, X., Lin, H., An, 412 H., Han, W., Wu, Z., Chi, M.: Improving the performance of distributed TensorFlow with RDMA. Int. J. Parallel Program. 46(4), 674–685 (2018). https://doi.org/10.1007/s10766-017-0520-3
Lu, X., Islam, NS.,Wasi-Ur-Rahman, M., Jose, J., Subramoni, H.,Wang, H., Panda, D.K.: High-performance design of Hadoop RPC with RDMA over InfiniBand. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp 641–650. IEEE (2013)
Mitchell, C., Geng, Y., Li, J.: Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In: USENIX Annual Technical Conference, pp. 103–114 (2013)
Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., Xing, E.P.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. arXiv preprint arXiv:1706.03292 (2017)
Mamidala, A.R., Kollias, G., Ward, C., Artico, F.: MXNET-MPI: embedding MPI parallelism in parameter server task model for scaling deep learning. ArXiv e-prints arXiv:1801.03855. http://adsabs.harvard.edu/abs/2018arXiv180103855M (2018)
Liu, J., Jiang,W.,Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp,W., Toonen, B.: In: 18th International Parallel and Distributed Processing Symposium, 2004 (IEEE, 2004), p. 16
Pandya, A.A.: TCP/IP processor and engine using RDMA (2008). US Patent 7,376,755
Kalia, A., Kaminsky, M., Andersen, D.G.: Using RDMA efficiently for key-value services. ACM SIGCOMM Comput. Commun. Rev. 44(4), 295–306 (2015)
Article Google Scholar

Download references

Acknowledgements

The work is supported by the National Key Research and Development Program of China(Grants No. 2016YFB1000403).

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230026, Anhui, China
Mingfan Li, Ke Wen, Han Lin, Xu Jin, Zheng Wu, Hong An & Mengxian Chi

Authors

Mingfan Li
View author publications
You can also search for this author in PubMed Google Scholar
Ke Wen
View author publications
You can also search for this author in PubMed Google Scholar
Han Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xu Jin
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hong An
View author publications
You can also search for this author in PubMed Google Scholar
Mengxian Chi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingfan Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Wen, K., Lin, H. et al. Improving the Performance of Distributed MXNet with RDMA. Int J Parallel Prog 47, 467–480 (2019). https://doi.org/10.1007/s10766-018-00623-w

Download citation

Received: 18 September 2018
Accepted: 18 December 2018
Published: 01 January 2019
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10766-018-00623-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the Performance of Distributed MXNet with RDMA

Abstract

Access this article

Similar content being viewed by others

Improving the Performance of Distributed TensorFlow with RDMA

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving the Performance of Distributed MXNet with RDMA

Abstract

Access this article

Similar content being viewed by others

Improving the Performance of Distributed TensorFlow with RDMA

EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation