Abstract
Deep learning has gained tremendous success in various fields while training deep neural networks (DNNs) is very compute-intensive, which results in numerous deep learning frameworks that aim to offer better usability and higher performance to deep learning practitioners. TensorFlow and PyTorch are the two most popular frameworks. TensorFlow is more promising within the industry context, while PyTorch is more appealing in academia. However, these two frameworks differ much owing to the opposite design philosophy: static vs dynamic computation graph. TensorFlow is regarded as being more performance-friendly as it has more opportunities to perform optimizations with the full view of the computation graph. However, there are also claims that PyTorch is faster than TensorFlow sometimes, which confuses the end-users on the choice between them. In this paper, we carry out the analytical and experimental analysis to unravel the mystery of comparison in training speed on single-GPU between TensorFlow and PyTorch. To ensure that our investigation is as comprehensive as possible, we carefully select seven popular neural networks, which cover computer vision, speech recognition, and natural language processing (NLP). The contributions of this work are two-fold. First, we conduct the detailed benchmarking experiments on TensorFlow and PyTorch and analyze the reasons for their performance difference. This work provides the guidance for the end-users to choose between these two frameworks. Second, we identify some key factors that affect the performance, which can direct the end-users to write their models more efficiently.
References
Jouppi N P, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, 2017. 1–12
Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, 2014. 675–678
Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016. 265–283
Paszke A, Gross S, Chintala S, et al. Automatic differentiation in PyTorch. In: Proceedings of the Autodiff Workshop on NIPS, 2017
Seide F, Agarwal A. CNTK: Microsoft’s open-source deep-learning toolkit. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 2135–2135
Shi S, Wang Q, Xu P, et al. Benchmarking state-of-the-art deep learning software tools. In: Proceedings of the 7th International Conference on Cloud Computing and Big Data (CCBD), 2016. 99–104
Mattson P, Cheng C, Coleman C, et al. MLPerf training benchmark. In: Proceedings of the 3rd Conference on Systems and Machine Learning (SysML’20), 2020
Zhu H, Akrout M, Zheng B, et al. TBD: benchmarking and analyzing deep neural network training. In: Proceedings of International Symposium on Workload Characterization (IISWC 2018), 2018
Zhang W, Wei W, Xu L, et al. AI matrix: a deep learning benchmark for Alibaba data centers. 2019. ArXiv:1909.10562
Liu L, Wu Y, Wei W, et al. Benchmarking deep learning frameworks: design considerations, metrics and beyond. In: Proceedings of 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018. 1258–1269
Shatnawi A, Al-Bdour G, Al-Qurran R, et al. A comparative study of open source deep learning frameworks. In: Proceedings of the 9th International Conference on Information and Communication Systems (ICICS), 2018. 72–77
Coleman C, Narayanan D, Kang D, et al. DAWNBench: an end-to-end deep learning benchmark and competition. Training, 2017, 100: 102
Karki A, Keshava C P, Shivakumar S M, et al. Detailed characterization of deep neural networks on GPUs and FPGAs. In: Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019. 12–21
Jiang Y, Zhu Y, Lan C, et al. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In: Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020. 463–479
Hashemi S H, Jyothi S A, Campbell R H. TicTac: accelerating distributed deep learning with communication scheduling. 2018. ArXiv:1803.03288
Jayarajan A, Wei J, Gibson G, et al. Priority-based parameter propagation for distributed DNN training. 2019. ArXiv: 1905.03960
Peng Y, Zhu Y, Chen Y, et al. A generic communication scheduler for distributed DNN training acceleration. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 16–29
Wang G, Venkataraman S, Phanishayee A, et al. Blink: fast and generic collectives for distributed ML. 2019. ArXiv: 1910.04940
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 9: 1735–1780
Braun S. LSTM benchmarks for deep learning frameworks. 2018. ArXiv:1806.01818
Krueger D, Maharaj T, Kramár J, et al. Zoneout: regularizing RNNs by randomly preserving hidden activations. 2016. ArXiv:1606.01305
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2012. 1097–1105
Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 1–9
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778
Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors. Nature, 1986, 323: 533–536
Yu X, Loh N K, Miller W. A new acceleration technique for the backpropagation algorithm. In: Proceedings of IEEE International Conference on Neural Networks, 1993. 1157–1161
Kingma D P, Ba J. ADAM: a method for stochastic optimization. 2014. ArXiv:1412.6980
Chen T, Li M, Li Y, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. 2015. ArXiv:1512.01274
Agrawal A, Modi A N, Passos A, et al. TensorFlow eager: a multi-stage, Python-embedded DSL for machine learning. In: Proceedings of the 2nd Conference on Systems and Machine Learning (SysML’19), 2019
Collobert R, Kavukcuoglu K, Farabet C. Torch7: a Matlab-like environment for machine learning. In: Proceedings of BigLearn, NIPS workshop, 2011
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. ArXiv:1409.1556
Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818–2826
Amodei D, Ananthanarayanan S, Anubhai R, et al. DeepSpeech2: end-to-end speech recognition in English and Mandarin. In: Proceedings of International Conference on Machine Learning, 2016. 173–182
Shen J, Pang R, Weiss R J, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. 4779–4783
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008
Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. 2016. ArXiv:1609.08144
Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. 2018. ArXiv:1810.04805
Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255
Panayotov V, Chen G, Povey D, et al. Librispeech: an ASR corpus based on public domain audio books. In: Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. 5206–5210
Bahrampour S, Ramakrishnan N, Schott L, et al. Comparative study of deep learning software frameworks. 2015. ArXiv:1511.06435
Lavin A, Gray S. Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4013–4021
Xiao W, Han Z, Zhao H, et al. Scheduling CPU for GPU-based deep learning jobs. In: Proceedings of the ACM Symposium on Cloud Computing, 2018. 503–503
Chen T, Moreau T, Jiang Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018. 578–594
Rotem N, Fix J, Abdulrasool S, et al. Glow: graph lowering compiler techniques for neural networks. 2018. ArXiv:1805.00907
Ma L, Xie Z, Yang Z, et al. Rammer: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020. 881–897
Jia Z, Padon O, Thomas J, et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019. 47–62
Chen T, Zheng L, Yan E, et al. Learning to optimize tensor programs. Adv Neural Inf Process Syst, 2018, 31: 3389–3400
Zheng S, Liang Y, Wang S, et al. Flextensor: an automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, 2020. 859–873
Vasilache N, Zinenko O, Theodoridis T, et al. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018. ArXiv:1802.04730
Ragan-Kelley J, Barnes C, Adams A, et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not, 2013, 48: 519–530
Mirhoseini A, Pham H, Le Q V, et al. Device placement optimization with reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, 2017, 70: 2430–2439
Sergeev A, Balso M D. Horovod: fast and easy distributed deep learning in TensorFlow. 2018. ArXiv:1802.05799
Wu Y, Cao W, Sahin S, et al. Experimental characterizations and analysis of deep learning frameworks. In: Proceedings of IEEE International Conference on Big Data (Big Data), 2018. 372–377
Shi S, Wang Q, Chu X. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In: Proceedings of 2018 IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), 2018. 949–957
Wang Y E, Wei G Y, Brooks D. A systematic methodology for analysis of deep learning hardware and software platforms. In: Proceedings of the 3rd Conference on Systems and Machine Learning (SysML’20), 2020
Xu P, Shi S, Chu X. Performance evaluation of deep learning tools in docker containers. In: Proceedings of the 3rd International Conference on Big Data Computing and Communications (BIGCOM), 2017. 395–403
Shams S, Platania R, Lee K, et al. Evaluation of deep learning frameworks over different HPC architectures. In: Proceedings of IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017. 1389–1396
Awan A A, Subramoni H, Panda D K. An in-depth performance characterization of CPU-and GPU-based DNN training on modern architectures. In: Proceedings of the Machine Learning on HPC Environments, 2017. 1–8
Velasco-Montero D, Fernandez-Berni J, Carmona-Galan R, et al. Optimum selection of DNN model and framework for edge inference. IEEE Access, 2018, 6: 51680–51692
Acknowledgements
This work was partly supported by National Key R&D Program of China (Grant No. 2017YFC0803700) and National Natural Science Foundation of China (Grant No. 61772218). We thank Fan YANG, Ying CAO and Xiaosong MA for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dai, H., Peng, X., Shi, X. et al. Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment. Sci. China Inf. Sci. 65, 112103 (2022). https://doi.org/10.1007/s11432-020-3182-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-020-3182-1