skip to main content
10.1145/3508396.3512869acmconferencesArticle/Chapter ViewAbstractPublication PageshotmobileConference Proceedingsconference-collections
short-paper

Towards efficient vision transformer inference: a first study of transformers on mobile devices

Published:09 March 2022Publication History

ABSTRACT

Convolution neural networks (CNNs) have long been dominating the model choice in on-device intelligent mobile applications. Recently, we are witnessing the fast development of vision transformers, which are notable for the use of the self-attention mechanism, have demonstrated the superiority in accuracy over CNNs. However, vision transformers are with expensive computation costs, and their inference efficiency on resource-constrained mobile devices are still unclear by now. This brings a lot of uncertainty for on-device intelligence to benefit from the vision transformers.

In this work, we carry out the first empirical study to investigate the possibility of efficiently deploying vision transformers on mobile devices. Our twofold study (i) profiles the representative vision transformers to understand the inference performance on commercial mobile devices and the behind reasons; and (ii) study multi-dimensional DNN acceleration approaches to achieve minimal latency. Results show that it is too expensive for vision transformer inference on mobile devices. Its inference is 1.58x-41x slower than CNNs. By removing the redundant Attention heads and FFN layers, DeiT-Tiny saves 23.2\% latency with negligible 0.75\% accuracy loss. Our study provides 7 insightful findings for future efficient vision transformer optimization and design.

References

  1. Yang Bai, Xufeng Yao, Qi Sun, and Bei Yu. 2021. AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.Google ScholarGoogle Scholar
  2. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).Google ScholarGoogle Scholar
  3. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chenGoogle ScholarGoogle Scholar
  4. The TFLite developers. 2021. TFLite Model Benchmark Tool with C++ Binary. https://github.com/tensorflow/tensorflow/tree/r2.7/tensorflow/lite/tools/benchmark.Google ScholarGoogle Scholar
  5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171--4186.Google ScholarGoogle Scholar
  6. Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herve Jegou, and Matthijs Douze. 2021. LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12259--12269.Google ScholarGoogle ScholarCross RefCross Ref
  7. Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  8. Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  9. Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314--1324.Google ScholarGoogle ScholarCross RefCross Ref
  10. Intel. 2021. Deploy High-Performance Deep Learning Inference, OpenVINO. https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html.Google ScholarGoogle Scholar
  11. Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In The International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  12. François Lagunas, Ella Charlaix, Victor Sanh, and Alexander Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10619--10629.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. International Conference on Computer Vision (ICCV) (2021).Google ScholarGoogle ScholarCross RefCross Ref
  14. Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, and Yiran Chen. 2021. TPrune: Efficient Transformer Pruning for Mobile Devices. ACM Trans. Cyber-Phys. Syst. 5, 3, Article 26 (apr 2021). 2378-962X Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sachin Mehta and Mohammad Rastegari. 2021. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. [arxiv]2110.02178 [cs.CV]Google ScholarGoogle Scholar
  16. Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDL). 883--898.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, and James T. Kwok. 2021. SparseBERT: Rethinking the Importance Analysis in Self-attention. In Proceedings of the 38th International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar
  18. Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML). 6105--6114.Google ScholarGoogle Scholar
  19. Xiaohu Tang, Shihao Han, Li Lyna Zhang, Ting Cao, and Yunxin Liu. 2021. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks. In Proceedings of Machine Learning and Systems (MLSys), Vol. 3. 21--37.Google ScholarGoogle Scholar
  20. TFLite. 2021. Post-training quantization. https://www.tensorflow.org/lite/performance/post_training_quant.Google ScholarGoogle Scholar
  21. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning (ICML). 10347--10357.Google ScholarGoogle Scholar
  22. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30. Curran Associates, Inc.Google ScholarGoogle Scholar
  23. Yu Wang, Gu-Yeon Wei, and David Brooks. 2020. A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms. In Proceedings of Machine Learning and Systems (MLSys), Vol. 2. 30--43.Google ScholarGoogle Scholar
  24. Ziheng Wang. 2021. SparseDNN: Fast Sparse Deep Learning Inference on CPUs. [arxiv]2101.07948 [cs.LG]Google ScholarGoogle Scholar
  25. Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125--2136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 558--567.Google ScholarGoogle ScholarCross RefCross Ref
  27. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  28. Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In The 19th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys 2021).Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards efficient vision transformer inference: a first study of transformers on mobile devices

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications
        March 2022
        137 pages
        ISBN:9781450392181
        DOI:10.1145/3508396

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 March 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate96of345submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader