ABSTRACT
Convolution neural networks (CNNs) have long been dominating the model choice in on-device intelligent mobile applications. Recently, we are witnessing the fast development of vision transformers, which are notable for the use of the self-attention mechanism, have demonstrated the superiority in accuracy over CNNs. However, vision transformers are with expensive computation costs, and their inference efficiency on resource-constrained mobile devices are still unclear by now. This brings a lot of uncertainty for on-device intelligence to benefit from the vision transformers.
In this work, we carry out the first empirical study to investigate the possibility of efficiently deploying vision transformers on mobile devices. Our twofold study (i) profiles the representative vision transformers to understand the inference performance on commercial mobile devices and the behind reasons; and (ii) study multi-dimensional DNN acceleration approaches to achieve minimal latency. Results show that it is too expensive for vision transformer inference on mobile devices. Its inference is 1.58x-41x slower than CNNs. By removing the redundant Attention heads and FFN layers, DeiT-Tiny saves 23.2\% latency with negligible 0.75\% accuracy loss. Our study provides 7 insightful findings for future efficient vision transformer optimization and design.
- Yang Bai, Xufeng Yao, Qi Sun, and Bei Yu. 2021. AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.Google Scholar
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).Google Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle Scholar
- The TFLite developers. 2021. TFLite Model Benchmark Tool with C++ Binary. https://github.com/tensorflow/tensorflow/tree/r2.7/tensorflow/lite/tools/benchmark.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171--4186.Google Scholar
- Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herve Jegou, and Matthijs Douze. 2021. LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12259--12269.Google ScholarCross Ref
- Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR).Google Scholar
- Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV).Google Scholar
- Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314--1324.Google ScholarCross Ref
- Intel. 2021. Deploy High-Performance Deep Learning Inference, OpenVINO. https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html.Google Scholar
- Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In The International Conference on Learning Representations (ICLR).Google Scholar
- François Lagunas, Ella Charlaix, Victor Sanh, and Alexander Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10619--10629.Google ScholarCross Ref
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. International Conference on Computer Vision (ICCV) (2021).Google ScholarCross Ref
- Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, and Yiran Chen. 2021. TPrune: Efficient Transformer Pruning for Mobile Devices. ACM Trans. Cyber-Phys. Syst. 5, 3, Article 26 (apr 2021). 2378-962X Google ScholarDigital Library
- Sachin Mehta and Mohammad Rastegari. 2021. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. [arxiv]2110.02178 [cs.CV]Google Scholar
- Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDL). 883--898.Google ScholarDigital Library
- Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, and James T. Kwok. 2021. SparseBERT: Rethinking the Importance Analysis in Self-attention. In Proceedings of the 38th International Conference on Machine Learning (ICML).Google Scholar
- Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML). 6105--6114.Google Scholar
- Xiaohu Tang, Shihao Han, Li Lyna Zhang, Ting Cao, and Yunxin Liu. 2021. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks. In Proceedings of Machine Learning and Systems (MLSys), Vol. 3. 21--37.Google Scholar
- TFLite. 2021. Post-training quantization. https://www.tensorflow.org/lite/performance/post_training_quant.Google Scholar
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning (ICML). 10347--10357.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30. Curran Associates, Inc.Google Scholar
- Yu Wang, Gu-Yeon Wei, and David Brooks. 2020. A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms. In Proceedings of Machine Learning and Systems (MLSys), Vol. 2. 30--43.Google Scholar
- Ziheng Wang. 2021. SparseDNN: Fast Sparse Deep Learning Inference on CPUs. [arxiv]2101.07948 [cs.LG]Google Scholar
- Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125--2136.Google ScholarDigital Library
- Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 558--567.Google ScholarCross Ref
- Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
- Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In The 19th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys 2021).Google ScholarDigital Library
Index Terms
- Towards efficient vision transformer inference: a first study of transformers on mobile devices
Recommendations
The Application of Vision Transformer in Image Classification
ICVARS '22: Proceedings of the 6th International Conference on Virtual and Augmented Reality SimulationsThis project aims to study the different performance between the Vision Transformer and a Convolu- tional Nerual Network. Google Colab will be used as the environment in this project. The dataset will use CIFAR-100 image dataset to train vision ...
Low-level Optimizations for Faster Mobile Deep Learning Inference Frameworks
MM '20: Proceedings of the 28th ACM International Conference on MultimediaOver the last ten years, we have seen a strong progression of technology around smartphones. Each new generation acquires capabilities that significantly increase performance. On the other hand, several deep learning tools are offered today by the ...
Vision transformer models for mobile/edge devices: a survey
AbstractWith the rapidly growing demand for high-performance deep learning vision models on mobile and edge devices, this paper emphasizes the importance of compact deep learning-based vision models that can provide high accuracy while maintaining a small ...
Comments