Skip to main content
Log in

Improving the Performance of Distributed MXNet with RDMA

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

As one of the most influential deep learning frameworks, MXNet has achieved excellent performance and many breakthroughs in academic and industrial fields for various machine learning situations. The initial implementation of MXNet uses proxy-socket interface, which delivers suboptimal performance in distributed environment. In a massive parallel training task, parameters are updated frequently during each training loop, in which case network performance becomes the main factor of overall performance. Over the past decade, high performance interconnects have employed remote direct memory access (RDMA) technology to provide excellent performance for numerous scientific domains. In this paper, we describe an efficient design that extends the open-source MXNet to make it RDMA capable via RDMA-based parameter server interfaces. With modest optimizations towards memory usage and transmission overhead, RDMA-based MXNet achieves great performance improvement over the original software. Our experiments reveal that, for the communication subsystem of MXNet, the new design achieves 16x speedup (up to 21x at peak) over 1 Gigabit Ethernet (1GigE). For the two training cases on MXNet, the optimized implementation gains 5x and 9x speedup, respectively. Compared to experiments on the IP-over-InfiniBand (IPoIB) protocol, it achieves nearly 30% performance improvement, as well as better scalability and alleviation of bottlenecks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. de Bruijne, M.: Machine learning approaches in medical image analysis: from detection to diagnosis. Med. Image. Anal. 33, 94–97 (2016). https://doi.org/10.1016/j.media.2016.06.032

    Article  Google Scholar 

  2. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018). https://doi.org/10.1109/MCI.2018.2840738

    Article  Google Scholar 

  3. Pérez, G., Arbeláez, P.: Automated detection of lung nodules with three-dimensional convolutional neural networks. Proc. SPIE 10572, 10572-1-10572-10 (2017). https://doi.org/10.1117/12.2285954

  4. Huang G., Sun, Y., Liu, Z., Sedra, D.,Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision, pp. 646–661. Springer (2016)

  5. You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR. arXiv:1709.05011 (2017)

  6. Grun, P., Hefty, S., Sur, S., Goodell, D., Russell, R.D., Pritchard, H., Squyres, J.M.: A brief introduction to the OpenFabrics interfaces: a new network API for maximizing high performance application efficiency. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 34–39 (2015). https://doi.org/10.1109/HOTI.2015.19

  7. Hintjens, P.: ZeroMQ: the guide. http://zguide.zeromq.org/page:all (2010)

  8. MacArthur, P., Liu, Q., Russell, R.D., Mizero, F., Veeraraghavan, M., Dennis, J.M.: An integrated tutorial on InfiniBand, verbs, and MPI. IEEE Commun. Surv. Tutorials 19(4), 2894–2926 (2017). https://doi.org/10.1109/COMST.2017.2746083

    Article  Google Scholar 

  9. RDMA Consortium and others: Architectural specifications for RDMA over TCP/IP (2009)

  10. Li, M., Zhou, L.,Yang, Z., Li, A., Xia, F., Andersen, D.G., Smola, A.: Parameter server for distributed machine learning. In: Big Learning NIPS Workshop, vol. 6, p. 2 (2013)

  11. Buyya, R., Cortes, T., Jin, H.: An introduction to the InfiniBand architecture. In: High Performance Mass Storage and Parallel I/O: Technologies and Applications (2002). https://doi.org/10.1109/9780470544839.ch42

  12. Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Program. 32(3), 167–198 (2004). https://doi.org/10.1023/B:IJPP.0000029272.69895.c1

    Article  MATH  Google Scholar 

  13. Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance RDMA-based design of HDFS over InfiniBand. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 35. IEEE Computer Society Press (2012)

  14. Jose, J., Subramoni, H., Luo, M., Zhang, M., Huang, J., Wasi-ur Rahman, M., Islam, N.S., Ouyang, X., Wang, H., Sur, S., et al.: Memcached design on high performance rdma capable interconnects. In: 2011 International Conference on Parallel Processing (ICPP), pp. 743–752. IEEE (2011)

  15. Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI runtimes: experience with MVAPICH. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, p. 5. ACM (2010)

  16. Jia, C., Liu, J., Jin, X., Lin, H., An, 412 H., Han, W., Wu, Z., Chi, M.: Improving the performance of distributed TensorFlow with RDMA. Int. J. Parallel Program. 46(4), 674–685 (2018). https://doi.org/10.1007/s10766-017-0520-3

  17. Lu, X., Islam, NS.,Wasi-Ur-Rahman, M., Jose, J., Subramoni, H.,Wang, H., Panda, D.K.: High-performance design of Hadoop RPC with RDMA over InfiniBand. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp 641–650. IEEE (2013)

  18. Mitchell, C., Geng, Y., Li, J.: Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In: USENIX Annual Technical Conference, pp. 103–114 (2013)

  19. Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., Xing, E.P.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. arXiv preprint arXiv:1706.03292 (2017)

  20. Mamidala, A.R., Kollias, G., Ward, C., Artico, F.: MXNET-MPI: embedding MPI parallelism in parameter server task model for scaling deep learning. ArXiv e-prints arXiv:1801.03855. http://adsabs.harvard.edu/abs/2018arXiv180103855M (2018)

  21. Liu, J., Jiang,W.,Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp,W., Toonen, B.: In: 18th International Parallel and Distributed Processing Symposium, 2004 (IEEE, 2004), p. 16

  22. Pandya, A.A.: TCP/IP processor and engine using RDMA (2008). US Patent 7,376,755

  23. Kalia, A., Kaminsky, M., Andersen, D.G.: Using RDMA efficiently for key-value services. ACM SIGCOMM Comput. Commun. Rev. 44(4), 295–306 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

The work is supported by the National Key Research and Development Program of China(Grants No. 2016YFB1000403).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingfan Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Wen, K., Lin, H. et al. Improving the Performance of Distributed MXNet with RDMA. Int J Parallel Prog 47, 467–480 (2019). https://doi.org/10.1007/s10766-018-00623-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-018-00623-w

Keywords

Navigation