Vision transformer models for mobile/edge devices: a survey

Lee, Seung Il; Koo, Kwanghyun; Lee, Jong Ho; Lee, Gilha; Jeong, Sangbeom; O, Seongjun; Kim, Hyun

doi:10.1007/s00530-024-01312-0

Vision transformer models for mobile/edge devices: a survey

Regular Paper
Published: 01 April 2024

Volume 30, article number 109, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Seung Il Lee¹^na1,
Kwanghyun Koo¹^na1,
Jong Ho Lee¹,
Gilha Lee¹,
Sangbeom Jeong¹,
Seongjun O¹ &
…
Hyun Kim¹

1109 Accesses
6 Citations
Explore all metrics

Abstract

With the rapidly growing demand for high-performance deep learning vision models on mobile and edge devices, this paper emphasizes the importance of compact deep learning-based vision models that can provide high accuracy while maintaining a small model size. In particular, based on the success of transformer models in natural language processing and computer vision tasks, this paper offers a comprehensive examination of the latest research in redesigning the Vision Transformer (ViT) model into a compact architecture suitable for mobile/edge devices. The paper classifies compact ViT models into three major categories: (1) architecture and hierarchy restructuring, (2) encoder block enhancements, and (3) integrated approaches, and provides a detailed overview of each category. This paper also analyzes the contribution of each method to model performance and computational efficiency, providing a deeper understanding of how to efficiently implement ViT models on edge devices. As a result, this paper can offer new insights into the design and implementation of compact ViT models for researchers in this field and provide guidelines for optimizing the performance and improving the efficiency of deep learning vision models on edge devices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers

Efficient Detection Model Using Feature Maximizer Convolution for Edge Computing

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

References

Choi, J., Chun, D., Kim, H., Lee, H.-J.: Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 502–511 (2019)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications (2017). arXiv preprint. arXiv:1704.04861
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Lee, S.I., Kim, H.: GaussianMask: uncertainty-aware instance segmentation based on gaussian modeling. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 3851–3857. IEEE (2022)
Vishwakarma, D.K., Singh, T.: A visual cognizance based multi-resolution descriptor for human action recognition using key pose. AEU Int. J. Electron. Commun. 107, 157–169 (2019)
Article Google Scholar
Singh, T., Vishwakarma, D.K.: Video benchmarks of human action datasets: a review. Artif. Intell. Rev. 52, 1107–1154 (2019)
Article Google Scholar
Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33, 469–485 (2021)
Article Google Scholar
Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020)
Article Google Scholar
Chun, D., Choi, J., Lee, H.-J., Kim, H.: CP-CNN: computational parallelization of CNN-based object detectors in heterogeneous embedded systems for autonomous driving. IEEE Access 11, 52812–52823 (2023)
Google Scholar
Lee, J., Jang, J., Lee, J., Chun, D., Kim, H.: CNN-based mask-pose fusion for detecting specific persons on heterogeneous embedded systems. IEEE Access 9, 120358–120366 (2021)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. Inpreprint (2018)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint. arXiv:1810.04805
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Google (2014)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint. arXiv:2010.11929
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., Cui, L.: Width & depth pruning for vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3143–3151 (2022)
Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. Adv. Neural Inf. Process. Syst. 34, 19974–19988 (2021)
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J.: DynamicViT: efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 34, 13937–13949 (2021)
Google Scholar
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: expediting vision transformers via token reorganizations (2022). arXiv preprint. arXiv:2202.07800
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst. 34, 28092–28103 (2021)
Google Scholar
Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., He, Y.: ZeroQuant: efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process. Syst. 35, 27168–27183 (2022)
Google Scholar
Tang, Y., Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., Tao, D.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)
Lee, J.H., Kim, H.: Discrete cosine transformed images are easy to recognize in vision transformers. IEIE Trans. Smart Process. Comput. 12(1), 48–54 (2023)
Article Google Scholar
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Kim, N.J., Kim, H.: FP-AGL: filter pruning with adaptive gradient learning for accelerating deep convolutional neural networks. IEEE Trans. Multimed. 25, 5279–5290 (2023)
Article Google Scholar
Kim, S., Kim, H.: Zero-centered fixed-point quantization with iterative retraining for deep convolutional neural network-based object detectors. IEEE Access 9, 20828–20839 (2021)
Article Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Chuanyang, Z., Li, Z., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S.: SAViT: structure-aware vision transformer pruning via collaborative optimization. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=w5DacXWzQ-Q
Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmentation (2023). arXiv preprint. arXiv:2306.07050
Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: TinyViT: fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision, pp. 68–85. Springer, Berlin (2022)
Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: FQ-ViT: post-training quantization for fully quantized vision transformer (2021). arXiv preprint. arXiv:2111.13824
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 377–386 (2021)
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer (2021). arXiv preprint. arXiv:2110.02178
Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12259–12269 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: bridging MobileNet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5270–5279 (2022)
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: EfficientFormer: vision transformers at MobileNet speed. Adv. Neural Inf. Process. Syst. 35, 12934–12949 (2022)
Google Scholar
Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., Gao, J.: Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008 (2021)
Zheng, H., Wang, J., Zhen, X., Chen, H., Song, J., Zheng, F.: CageViT: convolutional activation guided efficient vision transformer (2023). arXiv preprint. arXiv:2305.09924
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Pan, X., Ye, T., Xia, Z., Song, S., Huang, G.: Slide-transformer: hierarchical vision transformer with local self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2082–2091 (2023)
Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018)
Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: MaxViT: multi-axis vision transformer. In: Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 459–479. Springer, Berlin (2022)
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: DaViT: dual attention vision transformers. In: Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 74–92. Springer, Berlin (2022)
Hechen, Z., Huang, W., Zhao, Y.: ViT-LSLA: vision transformer with light self-limited-attention (2022). arXiv preprint. arXiv:2210.17115
Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., Yuille, A.: Lite vision transformer with enhanced self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11998–12008 (2022)
Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X.-P., Mei, T.: Dual vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10870–10882 (2023)
Article Google Scholar
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: MetaFormer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Pan, J., Bulat, A., Tan, F., Zhu, X., Dudziak, L., Li, H., Tzimiropoulos, G., Martinez, B.: EdgeViTs: competing light-weight CNNs on mobile devices with vision transformers. In: Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, pp. 294–311. Springer, Berlin (2022)
Li, S., Wang, Z., Liu, Z., Tan, C., Lin, H., Wu, D., Chen, Z., Zheng, J., Li, S.Z.: Efficient multi-order gated aggregation network (2022). arXiv preprint. arXiv:2211.03295
Yang, C., Qiao, S., Yu, Q., Yuan, X., Zhu, Y., Yuille, A., Adam, H., Chen, L.-C.: MOAT: alternating mobile convolution and attention brings strong vision models (2022). arXiv preprint. arXiv:2210.01820
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588 (2021)
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 34, 3965–3977 (2021)
Google Scholar
Huang, T., Huang, L., You, S., Wang, F., Qian, C., Xu, C.: LightViT: towards light-weight convolution-free vision transformers (2022). arXiv preprint. arXiv:2207.05557
Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: FastViT: a fast hybrid vision transformer using structural reparameterization (2023). arXiv preprint. arXiv:2303.14189
Cai, H., Gan, C., Han, S.: EfficientViT: enhanced linear attention for high-resolution low-computation visual recognition (2022). arXiv preprint. arXiv:2205.14756
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, Berlin (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

Download references

Acknowledgements

This work was partly supported by the National R &D Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2022M3I7A1078936) and Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2019R1A6A1A03032119).

Author information

Seung Il Lee and Kwanghyun Koo contributed equally to this work.

Authors and Affiliations

Department of Electrical and Information Engineering, Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, 232 Gongneung-ro, Nowon-gu, Seoul, 01811, Korea
Seung Il Lee, Kwanghyun Koo, Jong Ho Lee, Gilha Lee, Sangbeom Jeong, Seongjun O & Hyun Kim

Authors

Seung Il Lee
View author publications
You can also search for this author inPubMed Google Scholar
Kwanghyun Koo
View author publications
You can also search for this author inPubMed Google Scholar
Jong Ho Lee
View author publications
You can also search for this author inPubMed Google Scholar
Gilha Lee
View author publications
You can also search for this author inPubMed Google Scholar
Sangbeom Jeong
View author publications
You can also search for this author inPubMed Google Scholar
Seongjun O
View author publications
You can also search for this author inPubMed Google Scholar
Hyun Kim
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

S. Lee and K. Koo collaborated on the writing and review of the paper. They jointly authored the figures and text for the section titled ‘Introduction’, ‘Background’, and ‘Performance Analysis of Various ViT Models’. All authors collaborated on the investigation of the compact ViT models. S. O was primarily responsible for writing for the section titled ‘Architecture and Hierarchy Restructuring.’ J. Lee. led the writing for the section on ‘Encoder Block Enhancements.’ S.J. authored the text for the ‘Integrated Approaches’ section of the paper. G. Lee compiled and summarized the experimental results for the paper and was also responsible for creating and formatting all tables. H. Kim supervised and reviewed the entire process. All authors reviewed, edited, and approved the final manuscript. This collective effort ensured a comprehensive and well-rounded analysis of the compact ViT models, contributing to the overall quality and depth of the review paper.

Corresponding author

Correspondence to Hyun Kim.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by J. Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lee, S.I., Koo, K., Lee, J.H. et al. Vision transformer models for mobile/edge devices: a survey. Multimedia Systems 30, 109 (2024). https://doi.org/10.1007/s00530-024-01312-0

Download citation

Received: 31 July 2023
Accepted: 04 March 2024
Published: 01 April 2024
DOI: https://doi.org/10.1007/s00530-024-01312-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vision transformer models for mobile/edge devices: a survey

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers

Efficient Detection Model Using Feature Maximizer Convolution for Edge Computing

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now