skip to main content
10.1145/3508396.3512869acmconferencesArticle/Chapter ViewAbstractPublication PageshotmobileConference Proceedingsconference-collections
short-paper

Towards efficient vision transformer inference: a first study of transformers on mobile devices

Published: 09 March 2022 Publication History

Abstract

Convolution neural networks (CNNs) have long been dominating the model choice in on-device intelligent mobile applications. Recently, we are witnessing the fast development of vision transformers, which are notable for the use of the self-attention mechanism, have demonstrated the superiority in accuracy over CNNs. However, vision transformers are with expensive computation costs, and their inference efficiency on resource-constrained mobile devices are still unclear by now. This brings a lot of uncertainty for on-device intelligence to benefit from the vision transformers.
In this work, we carry out the first empirical study to investigate the possibility of efficiently deploying vision transformers on mobile devices. Our twofold study (i) profiles the representative vision transformers to understand the inference performance on commercial mobile devices and the behind reasons; and (ii) study multi-dimensional DNN acceleration approaches to achieve minimal latency. Results show that it is too expensive for vision transformer inference on mobile devices. Its inference is 1.58x-41x slower than CNNs. By removing the redundant Attention heads and FFN layers, DeiT-Tiny saves 23.2\% latency with negligible 0.75\% accuracy loss. Our study provides 7 insightful findings for future efficient vision transformer optimization and design.

References

[1]
Yang Bai, Xufeng Yao, Qi Sun, and Bei Yu. 2021. AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.
[2]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).
[3]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen
[4]
The TFLite developers. 2021. TFLite Model Benchmark Tool with C++ Binary. https://github.com/tensorflow/tensorflow/tree/r2.7/tensorflow/lite/tools/benchmark.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171--4186.
[6]
Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herve Jegou, and Matthijs Douze. 2021. LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12259--12269.
[7]
Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR).
[8]
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV).
[9]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314--1324.
[10]
Intel. 2021. Deploy High-Performance Deep Learning Inference, OpenVINO. https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html.
[11]
Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In The International Conference on Learning Representations (ICLR).
[12]
François Lagunas, Ella Charlaix, Victor Sanh, and Alexander Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10619--10629.
[13]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. International Conference on Computer Vision (ICCV) (2021).
[14]
Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, and Yiran Chen. 2021. TPrune: Efficient Transformer Pruning for Mobile Devices. ACM Trans. Cyber-Phys. Syst. 5, 3, Article 26 (apr 2021). 2378-962X
[15]
Sachin Mehta and Mohammad Rastegari. 2021. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. [arxiv]2110.02178 [cs.CV]
[16]
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDL). 883--898.
[17]
Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, and James T. Kwok. 2021. SparseBERT: Rethinking the Importance Analysis in Self-attention. In Proceedings of the 38th International Conference on Machine Learning (ICML).
[18]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML). 6105--6114.
[19]
Xiaohu Tang, Shihao Han, Li Lyna Zhang, Ting Cao, and Yunxin Liu. 2021. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks. In Proceedings of Machine Learning and Systems (MLSys), Vol. 3. 21--37.
[20]
TFLite. 2021. Post-training quantization. https://www.tensorflow.org/lite/performance/post_training_quant.
[21]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning (ICML). 10347--10357.
[22]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30. Curran Associates, Inc.
[23]
Yu Wang, Gu-Yeon Wei, and David Brooks. 2020. A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms. In Proceedings of Machine Learning and Systems (MLSys), Vol. 2. 30--43.
[24]
Ziheng Wang. 2021. SparseDNN: Fast Sparse Deep Learning Inference on CPUs. [arxiv]2101.07948 [cs.LG]
[25]
Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125--2136.
[26]
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 558--567.
[27]
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems (NeurIPS).
[28]
Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In The 19th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys 2021).

Cited By

View all
  • (2025)A Hybrid Deep Learning Method for the Estimation of the State of Health of Lithium‐Ion BatteriesInternational Transactions on Electrical Energy Systems10.1155/etep/24428932025:1Online publication date: 16-Jan-2025
  • (2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
  • (2025)Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative MetricsIEEE Access10.1109/ACCESS.2025.353647113(25704-25722)Online publication date: 2025
  • Show More Cited By

Index Terms

  1. Towards efficient vision transformer inference: a first study of transformers on mobile devices

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications
      March 2022
      137 pages
      ISBN:9781450392181
      DOI:10.1145/3508396
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 March 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. edge AI
      2. mobile inference
      3. vision transformer

      Qualifiers

      • Short-paper

      Conference

      HotMobile '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 96 of 345 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)456
      • Downloads (Last 6 weeks)29
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)A Hybrid Deep Learning Method for the Estimation of the State of Health of Lithium‐Ion BatteriesInternational Transactions on Electrical Energy Systems10.1155/etep/24428932025:1Online publication date: 16-Jan-2025
      • (2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
      • (2025)Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative MetricsIEEE Access10.1109/ACCESS.2025.353647113(25704-25722)Online publication date: 2025
      • (2025)ViTs as backbones: Leveraging vision transformers for feature extractionInformation Fusion10.1016/j.inffus.2025.102951(102951)Online publication date: Jan-2025
      • (2024)A Brief Review of Lightweighting Methods for Vision Transformers (ViT)International Journal of Computer Science and Information Technology10.62051/ijcsit.v4n2.374:2(283-288)Online publication date: 10-Oct-2024
      • (2024)Low-Complexity Vision Transformer with Adaptive Channel Partitioning MethodJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.3.29129:3(291-305)Online publication date: 31-May-2024
      • (2024)LitePredProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691906(1463-1477)Online publication date: 16-Apr-2024
      • (2024)On-Edge Deployment of Vision Transformers for Medical Diagnostics Using the Kvasir-Capsule DatasetApplied Sciences10.3390/app1418811514:18(8115)Online publication date: 10-Sep-2024
      • (2024)Lightweight Deep Learning for Resource-Constrained Environments: A SurveyACM Computing Surveys10.1145/365728256:10(1-42)Online publication date: 24-Jun-2024
      • (2024)Real-Time Polyp Detection in Colonoscopy using Lightweight Transformer2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00763(7794-7804)Online publication date: 3-Jan-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media