short-paper

Towards efficient vision transformer inference: a first study of transformers on mobile devices

Authors:

Mao YangAuthors Info & Claims

HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications

Pages 1 - 7

https://doi.org/10.1145/3508396.3512869

Published: 09 March 2022 Publication History

Abstract

Convolution neural networks (CNNs) have long been dominating the model choice in on-device intelligent mobile applications. Recently, we are witnessing the fast development of vision transformers, which are notable for the use of the self-attention mechanism, have demonstrated the superiority in accuracy over CNNs. However, vision transformers are with expensive computation costs, and their inference efficiency on resource-constrained mobile devices are still unclear by now. This brings a lot of uncertainty for on-device intelligence to benefit from the vision transformers.

In this work, we carry out the first empirical study to investigate the possibility of efficiently deploying vision transformers on mobile devices. Our twofold study (i) profiles the representative vision transformers to understand the inference performance on commercial mobile devices and the behind reasons; and (ii) study multi-dimensional DNN acceleration approaches to achieve minimal latency. Results show that it is too expensive for vision transformer inference on mobile devices. Its inference is 1.58x-41x slower than CNNs. By removing the redundant Attention heads and FFN layers, DeiT-Tiny saves 23.2\% latency with negligible 0.75\% accuracy loss. Our study provides 7 insightful findings for future efficient vision transformer optimization and design.

References

[1]

Yang Bai, Xufeng Yao, Qi Sun, and Bei Yu. 2021. AutoGTCO: Graph and Tensor Co-Optimize for Image Recognition with Transformers on GPU. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.

[2]

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).

[3]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen

[4]

The TFLite developers. 2021. TFLite Model Benchmark Tool with C++ Binary. https://github.com/tensorflow/tensorflow/tree/r2.7/tensorflow/lite/tools/benchmark.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171--4186.

[6]

Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herve Jegou, and Matthijs Douze. 2021. LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12259--12269.

[7]

Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations (ICLR).

[8]

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In European Conference on Computer Vision (ECCV).

[9]

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314--1324.

[10]

Intel. 2021. Deploy High-Performance Deep Learning Inference, OpenVINO. https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html.

[11]

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In The International Conference on Learning Representations (ICLR).

[12]

François Lagunas, Ella Charlaix, Victor Sanh, and Alexander Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 10619--10629.

[13]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. International Conference on Computer Vision (ICCV) (2021).

[14]

Jiachen Mao, Huanrui Yang, Ang Li, Hai Li, and Yiran Chen. 2021. TPrune: Efficient Transformer Pruning for Mobile Devices. ACM Trans. Cyber-Phys. Syst. 5, 3, Article 26 (apr 2021). 2378-962X

Digital Library

[15]

Sachin Mehta and Mohammad Rastegari. 2021. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. [arxiv]2110.02178 [cs.CV]

[16]

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDL). 883--898.

Digital Library

[17]

Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, and James T. Kwok. 2021. SparseBERT: Rethinking the Importance Analysis in Self-attention. In Proceedings of the 38th International Conference on Machine Learning (ICML).

[18]

Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning (ICML). 6105--6114.

[19]

Xiaohu Tang, Shihao Han, Li Lyna Zhang, Ting Cao, and Yunxin Liu. 2021. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks. In Proceedings of Machine Learning and Systems (MLSys), Vol. 3. 21--37.

[20]

TFLite. 2021. Post-training quantization. https://www.tensorflow.org/lite/performance/post_training_quant.

[21]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning (ICML). 10347--10357.

[22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 30. Curran Associates, Inc.

[23]

Yu Wang, Gu-Yeon Wei, and David Brooks. 2020. A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms. In Proceedings of Machine Learning and Systems (MLSys), Vol. 2. 30--43.

[24]

Ziheng Wang. 2021. SparseDNN: Fast Sparse Deep Learning Inference on CPUs. [arxiv]2101.07948 [cs.LG]

[25]

Mengwei Xu, Jiawei Liu, Yuanqiang Liu, Felix Xiaozhu Lin, Yunxin Liu, and Xuanzhe Liu. 2019. A first look at deep learning apps on smartphones. In The World Wide Web Conference. 2125--2136.

Digital Library

[26]

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision Transformers From Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 558--567.

[27]

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems (NeurIPS).

[28]

Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In The 19th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys 2021).

Digital Library

Cited By

Cheng S(2025)A Hybrid Deep Learning Method for the Estimation of the State of Health of Lithium‐Ion BatteriesInternational Transactions on Electrical Energy Systems10.1155/etep/24428932025:1Online publication date: 16-Jan-2025
https://doi.org/10.1155/etep/2442893
Xu MCai DYin WWang SJin XLiu X(2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3706418
Park JAmangeldi AFakhrutdinov NKarzhaubayeva MZorbas D(2025)Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative MetricsIEEE Access10.1109/ACCESS.2025.353647113(25704-25722)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3536471
Show More Cited By

Index Terms

Towards efficient vision transformer inference: a first study of transformers on mobile devices
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
  2. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

The Application of Vision Transformer in Image Classification
ICVARS '22: Proceedings of the 2022 6th International Conference on Virtual and Augmented Reality Simulations

This project aims to study the different performance between the Vision Transformer and a Convolu- tional Nerual Network. Google Colab will be used as the environment in this project. The dataset will use CIFAR-100 image dataset to train vision ...
Low-level Optimizations for Faster Mobile Deep Learning Inference Frameworks
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Over the last ten years, we have seen a strong progression of technology around smartphones. Each new generation acquires capabilities that significantly increase performance. On the other hand, several deep learning tools are offered today by the ...
Vision transformer models for mobile/edge devices: a survey
Abstract
With the rapidly growing demand for high-performance deep learning vision models on mobile and edge devices, this paper emphasizes the importance of compact deep learning-based vision models that can provide high accuracy while maintaining a small ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HotMobile '22: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications

March 2022

137 pages

ISBN:9781450392181

DOI:10.1145/3508396

General Chair:
Robert LiKamWa
Arizona State University
,
Program Chair:
Urs Hengartner
University of Waterloo

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOBILE: ACM Special Interest Group on Mobility of Systems, Users, Data and Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

HotMobile '22

Sponsor:

SIGMOBILE

HotMobile '22: The 23rd International Workshop on Mobile Computing Systems and Applications

March 9 - 10, 2022

Arizona, Tempe

Acceptance Rates

Overall Acceptance Rate 96 of 345 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
2,054
Total Downloads

Downloads (Last 12 months)456
Downloads (Last 6 weeks)29

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cheng S(2025)A Hybrid Deep Learning Method for the Estimation of the State of Health of Lithium‐Ion BatteriesInternational Transactions on Electrical Energy Systems10.1155/etep/24428932025:1Online publication date: 16-Jan-2025
https://doi.org/10.1155/etep/2442893
Xu MCai DYin WWang SJin XLiu X(2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3706418
Park JAmangeldi AFakhrutdinov NKarzhaubayeva MZorbas D(2025)Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative MetricsIEEE Access10.1109/ACCESS.2025.353647113(25704-25722)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3536471
Elharrouss OHimeur YMahmood YAlrabaee SOuamane ABensaali FBechqito YChouchane A(2025)ViTs as backbones: Leveraging vision transformers for feature extractionInformation Fusion10.1016/j.inffus.2025.102951(102951)Online publication date: Jan-2025
https://doi.org/10.1016/j.inffus.2025.102951
Du X(2024)A Brief Review of Lightweighting Methods for Vision Transformers (ViT)International Journal of Computer Science and Information Technology10.62051/ijcsit.v4n2.374:2(283-288)Online publication date: 10-Oct-2024
https://doi.org/10.62051/ijcsit.v4n2.37
Lee SOh SHan SLim SLee JSim D(2024)Low-Complexity Vision Transformer with Adaptive Channel Partitioning MethodJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2024.29.3.29129:3(291-305)Online publication date: 31-May-2024
https://doi.org/10.5909/JBE.2024.29.3.291
Feng CZhang LLiu YXu JZhang CWang ZCao TYang MTan HVanbever LZhang I(2024)LitePredProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691906(1463-1477)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691906
Varam DKhalil LShanableh T(2024)On-Edge Deployment of Vision Transformers for Medical Diagnostics Using the Kvasir-Capsule DatasetApplied Sciences10.3390/app1418811514:18(8115)Online publication date: 10-Sep-2024
https://doi.org/10.3390/app14188115
Liu HGalindo MXie HWong LShuai HLi YCheng W(2024)Lightweight Deep Learning for Resource-Constrained Environments: A SurveyACM Computing Surveys10.1145/365728256:10(1-42)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3657282
Yoo YLee JLee DJeon JKim J(2024)Real-Time Polyp Detection in Colonoscopy using Lightweight Transformer2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00763(7794-7804)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00763
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten