Skip to main content
Log in

Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Deep learning has gained tremendous success in various fields while training deep neural networks (DNNs) is very compute-intensive, which results in numerous deep learning frameworks that aim to offer better usability and higher performance to deep learning practitioners. TensorFlow and PyTorch are the two most popular frameworks. TensorFlow is more promising within the industry context, while PyTorch is more appealing in academia. However, these two frameworks differ much owing to the opposite design philosophy: static vs dynamic computation graph. TensorFlow is regarded as being more performance-friendly as it has more opportunities to perform optimizations with the full view of the computation graph. However, there are also claims that PyTorch is faster than TensorFlow sometimes, which confuses the end-users on the choice between them. In this paper, we carry out the analytical and experimental analysis to unravel the mystery of comparison in training speed on single-GPU between TensorFlow and PyTorch. To ensure that our investigation is as comprehensive as possible, we carefully select seven popular neural networks, which cover computer vision, speech recognition, and natural language processing (NLP). The contributions of this work are two-fold. First, we conduct the detailed benchmarking experiments on TensorFlow and PyTorch and analyze the reasons for their performance difference. This work provides the guidance for the end-users to choose between these two frameworks. Second, we identify some key factors that affect the performance, which can direct the end-users to write their models more efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Jouppi N P, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, 2017. 1–12

  2. Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, 2014. 675–678

  3. Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016. 265–283

  4. Paszke A, Gross S, Chintala S, et al. Automatic differentiation in PyTorch. In: Proceedings of the Autodiff Workshop on NIPS, 2017

  5. Seide F, Agarwal A. CNTK: Microsoft’s open-source deep-learning toolkit. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 2135–2135

  6. Shi S, Wang Q, Xu P, et al. Benchmarking state-of-the-art deep learning software tools. In: Proceedings of the 7th International Conference on Cloud Computing and Big Data (CCBD), 2016. 99–104

  7. Mattson P, Cheng C, Coleman C, et al. MLPerf training benchmark. In: Proceedings of the 3rd Conference on Systems and Machine Learning (SysML’20), 2020

  8. Zhu H, Akrout M, Zheng B, et al. TBD: benchmarking and analyzing deep neural network training. In: Proceedings of International Symposium on Workload Characterization (IISWC 2018), 2018

  9. Zhang W, Wei W, Xu L, et al. AI matrix: a deep learning benchmark for Alibaba data centers. 2019. ArXiv:1909.10562

  10. Liu L, Wu Y, Wei W, et al. Benchmarking deep learning frameworks: design considerations, metrics and beyond. In: Proceedings of 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018. 1258–1269

  11. Shatnawi A, Al-Bdour G, Al-Qurran R, et al. A comparative study of open source deep learning frameworks. In: Proceedings of the 9th International Conference on Information and Communication Systems (ICICS), 2018. 72–77

  12. Coleman C, Narayanan D, Kang D, et al. DAWNBench: an end-to-end deep learning benchmark and competition. Training, 2017, 100: 102

    Google Scholar 

  13. Karki A, Keshava C P, Shivakumar S M, et al. Detailed characterization of deep neural networks on GPUs and FPGAs. In: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019. 12–21

  14. Jiang Y, Zhu Y, Lan C, et al. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020. 463–479

  15. Hashemi S H, Jyothi S A, Campbell R H. TicTac: accelerating distributed deep learning with communication scheduling. 2018. ArXiv:1803.03288

  16. Jayarajan A, Wei J, Gibson G, et al. Priority-based parameter propagation for distributed DNN training. 2019. ArXiv: 1905.03960

  17. Peng Y, Zhu Y, Chen Y, et al. A generic communication scheduler for distributed DNN training acceleration. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 16–29

  18. Wang G, Venkataraman S, Phanishayee A, et al. Blink: fast and generic collectives for distributed ML. 2019. ArXiv: 1910.04940

  19. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 9: 1735–1780

    Article  Google Scholar 

  20. Braun S. LSTM benchmarks for deep learning frameworks. 2018. ArXiv:1806.01818

  21. Krueger D, Maharaj T, Kramár J, et al. Zoneout: regularizing RNNs by randomly preserving hidden activations. 2016. ArXiv:1606.01305

  22. Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2012. 1097–1105

  23. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1–9

  24. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778

  25. Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors. Nature, 1986, 323: 533–536

    Article  MATH  Google Scholar 

  26. Yu X, Loh N K, Miller W. A new acceleration technique for the backpropagation algorithm. In: Proceedings of IEEE International Conference on Neural Networks, 1993. 1157–1161

  27. Kingma D P, Ba J. ADAM: a method for stochastic optimization. 2014. ArXiv:1412.6980

  28. Chen T, Li M, Li Y, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. 2015. ArXiv:1512.01274

  29. Agrawal A, Modi A N, Passos A, et al. TensorFlow eager: a multi-stage, Python-embedded DSL for machine learning. In: Proceedings of the 2nd Conference on Systems and Machine Learning (SysML’19), 2019

  30. Collobert R, Kavukcuoglu K, Farabet C. Torch7: a Matlab-like environment for machine learning. In: Proceedings of BigLearn, NIPS workshop, 2011

  31. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. ArXiv:1409.1556

  32. Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818–2826

  33. Amodei D, Ananthanarayanan S, Anubhai R, et al. DeepSpeech2: end-to-end speech recognition in English and Mandarin. In: Proceedings of International Conference on Machine Learning, 2016. 173–182

  34. Shen J, Pang R, Weiss R J, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. 4779–4783

  35. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008

  36. Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. 2016. ArXiv:1609.08144

  37. Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805

  38. Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255

  39. Panayotov V, Chen G, Povey D, et al. Librispeech: an ASR corpus based on public domain audio books. In: Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. 5206–5210

  40. Bahrampour S, Ramakrishnan N, Schott L, et al. Comparative study of deep learning software frameworks. 2015. ArXiv:1511.06435

  41. Lavin A, Gray S. Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4013–4021

  42. Xiao W, Han Z, Zhao H, et al. Scheduling CPU for GPU-based deep learning jobs. In: Proceedings of the ACM Symposium on Cloud Computing, 2018. 503–503

  43. Chen T, Moreau T, Jiang Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. 578–594

  44. Rotem N, Fix J, Abdulrasool S, et al. Glow: graph lowering compiler techniques for neural networks. 2018. ArXiv:1805.00907

  45. Ma L, Xie Z, Yang Z, et al. Rammer: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020. 881–897

  46. Jia Z, Padon O, Thomas J, et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 47–62

  47. Chen T, Zheng L, Yan E, et al. Learning to optimize tensor programs. Adv Neural Inf Process Syst, 2018, 31: 3389–3400

    Google Scholar 

  48. Zheng S, Liang Y, Wang S, et al. Flextensor: an automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, 2020. 859–873

  49. Vasilache N, Zinenko O, Theodoridis T, et al. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018. ArXiv:1802.04730

  50. Ragan-Kelley J, Barnes C, Adams A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not, 2013, 48: 519–530

    Article  Google Scholar 

  51. Mirhoseini A, Pham H, Le Q V, et al. Device placement optimization with reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, 2017, 70: 2430–2439

  52. Sergeev A, Balso M D. Horovod: fast and easy distributed deep learning in TensorFlow. 2018. ArXiv:1802.05799

  53. Wu Y, Cao W, Sahin S, et al. Experimental characterizations and analysis of deep learning frameworks. In: Proceedings of IEEE International Conference on Big Data (Big Data), 2018. 372–377

  54. Shi S, Wang Q, Chu X. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In: Proceedings of 2018 IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), 2018. 949–957

  55. Wang Y E, Wei G Y, Brooks D. A systematic methodology for analysis of deep learning hardware and software platforms. In: Proceedings of the 3rd Conference on Systems and Machine Learning (SysML’20), 2020

  56. Xu P, Shi S, Chu X. Performance evaluation of deep learning tools in docker containers. In: Proceedings of the 3rd International Conference on Big Data Computing and Communications (BIGCOM), 2017. 395–403

  57. Shams S, Platania R, Lee K, et al. Evaluation of deep learning frameworks over different HPC architectures. In: Proceedings of IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017. 1389–1396

  58. Awan A A, Subramoni H, Panda D K. An in-depth performance characterization of CPU-and GPU-based DNN training on modern architectures. In: Proceedings of the Machine Learning on HPC Environments, 2017. 1–8

  59. Velasco-Montero D, Fernandez-Berni J, Carmona-Galan R, et al. Optimum selection of DNN model and framework for edge inference. IEEE Access, 2018, 6: 51680–51692

    Article  Google Scholar 

Download references

Acknowledgements

This work was partly supported by National Key R&D Program of China (Grant No. 2017YFC0803700) and National Natural Science Foundation of China (Grant No. 61772218). We thank Fan YANG, Ying CAO and Xiaosong MA for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuanhua Shi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dai, H., Peng, X., Shi, X. et al. Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment. Sci. China Inf. Sci. 65, 112103 (2022). https://doi.org/10.1007/s11432-020-3182-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-020-3182-1

Keywords

Navigation