Abstract
With the rapidly growing demand for high-performance deep learning vision models on mobile and edge devices, this paper emphasizes the importance of compact deep learning-based vision models that can provide high accuracy while maintaining a small model size. In particular, based on the success of transformer models in natural language processing and computer vision tasks, this paper offers a comprehensive examination of the latest research in redesigning the Vision Transformer (ViT) model into a compact architecture suitable for mobile/edge devices. The paper classifies compact ViT models into three major categories: (1) architecture and hierarchy restructuring, (2) encoder block enhancements, and (3) integrated approaches, and provides a detailed overview of each category. This paper also analyzes the contribution of each method to model performance and computational efficiency, providing a deeper understanding of how to efficiently implement ViT models on edge devices. As a result, this paper can offer new insights into the design and implementation of compact ViT models for researchers in this field and provide guidelines for optimizing the performance and improving the efficiency of deep learning vision models on edge devices.






Similar content being viewed by others
Data availability
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
References
Choi, J., Chun, D., Kim, H., Lee, H.-J.: Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 502–511 (2019)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications (2017). arXiv preprint. arXiv:1704.04861
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Lee, S.I., Kim, H.: GaussianMask: uncertainty-aware instance segmentation based on gaussian modeling. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 3851–3857. IEEE (2022)
Vishwakarma, D.K., Singh, T.: A visual cognizance based multi-resolution descriptor for human action recognition using key pose. AEU Int. J. Electron. Commun. 107, 157–169 (2019)
Singh, T., Vishwakarma, D.K.: Video benchmarks of human action datasets: a review. Artif. Intell. Rev. 52, 1107–1154 (2019)
Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33, 469–485 (2021)
Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020)
Chun, D., Choi, J., Lee, H.-J., Kim, H.: CP-CNN: computational parallelization of CNN-based object detectors in heterogeneous embedded systems for autonomous driving. IEEE Access 11, 52812–52823 (2023)
Lee, J., Jang, J., Lee, J., Chun, D., Kim, H.: CNN-based mask-pose fusion for detecting specific persons on heterogeneous embedded systems. IEEE Access 9, 120358–120366 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. Inpreprint (2018)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint. arXiv:1810.04805
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Google (2014)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint. arXiv:2010.11929
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., Cui, L.: Width & depth pruning for vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3143–3151 (2022)
Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., Wang, Z.: Chasing sparsity in vision transformers: an end-to-end exploration. Adv. Neural Inf. Process. Syst. 34, 19974–19988 (2021)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J.: DynamicViT: efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 34, 13937–13949 (2021)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: expediting vision transformers via token reorganizations (2022). arXiv preprint. arXiv:2202.07800
Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst. 34, 28092–28103 (2021)
Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., He, Y.: ZeroQuant: efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process. Syst. 35, 27168–27183 (2022)
Tang, Y., Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., Tao, D.: Patch slimming for efficient vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174 (2022)
Lee, J.H., Kim, H.: Discrete cosine transformed images are easy to recognize in vision transformers. IEIE Trans. Smart Process. Comput. 12(1), 48–54 (2023)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Kim, N.J., Kim, H.: FP-AGL: filter pruning with adaptive gradient learning for accelerating deep convolutional neural networks. IEEE Trans. Multimed. 25, 5279–5290 (2023)
Kim, S., Kim, H.: Zero-centered fixed-point quantization with iterative retraining for deep convolutional neural network-based object detectors. IEEE Access 9, 20828–20839 (2021)
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Chuanyang, Z., Li, Z., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S.: SAViT: structure-aware vision transformer pruning via collaborative optimization. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=w5DacXWzQ-Q
Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmentation (2023). arXiv preprint. arXiv:2306.07050
Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: TinyViT: fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision, pp. 68–85. Springer, Berlin (2022)
Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: FQ-ViT: post-training quantization for fully quantized vision transformer (2021). arXiv preprint. arXiv:2111.13824
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 377–386 (2021)
Mehta, S., Rastegari, M.: MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer (2021). arXiv preprint. arXiv:2110.02178
Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12259–12269 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: bridging MobileNet and transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5270–5279 (2022)
Li, Y., Yuan, G., Wen, Y., Hu, J., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: EfficientFormer: vision transformers at MobileNet speed. Adv. Neural Inf. Process. Syst. 35, 12934–12949 (2022)
Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., Gao, J.: Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3008 (2021)
Zheng, H., Wang, J., Zhen, X., Chen, H., Song, J., Zheng, F.: CageViT: convolutional activation guided efficient vision transformer (2023). arXiv preprint. arXiv:2305.09924
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Pan, X., Ye, T., Xia, Z., Song, S., Huang, G.: Slide-transformer: hierarchical vision transformer with local self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2082–2091 (2023)
Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018)
Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: MaxViT: multi-axis vision transformer. In: Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 459–479. Springer, Berlin (2022)
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: DaViT: dual attention vision transformers. In: Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pp. 74–92. Springer, Berlin (2022)
Hechen, Z., Huang, W., Zhao, Y.: ViT-LSLA: vision transformer with light self-limited-attention (2022). arXiv preprint. arXiv:2210.17115
Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., Yuille, A.: Lite vision transformer with enhanced self-attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11998–12008 (2022)
Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X.-P., Mei, T.: Dual vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10870–10882 (2023)
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: MetaFormer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Pan, J., Bulat, A., Tan, F., Zhu, X., Dudziak, L., Li, H., Tzimiropoulos, G., Martinez, B.: EdgeViTs: competing light-weight CNNs on mobile devices with vision transformers. In: Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, pp. 294–311. Springer, Berlin (2022)
Li, S., Wang, Z., Liu, Z., Tan, C., Lin, H., Wu, D., Chen, Z., Zheng, J., Li, S.Z.: Efficient multi-order gated aggregation network (2022). arXiv preprint. arXiv:2211.03295
Yang, C., Qiao, S., Yu, Q., Yuan, X., Zhu, Y., Yuille, A., Adam, H., Chen, L.-C.: MOAT: alternating mobile convolution and attention brings strong vision models (2022). arXiv preprint. arXiv:2210.01820
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588 (2021)
Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 34, 3965–3977 (2021)
Huang, T., Huang, L., You, S., Wang, F., Qian, C., Xu, C.: LightViT: towards light-weight convolution-free vision transformers (2022). arXiv preprint. arXiv:2207.05557
Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: FastViT: a fast hybrid vision transformer using structural reparameterization (2023). arXiv preprint. arXiv:2303.14189
Cai, H., Gan, C., Han, S.: EfficientViT: enhanced linear attention for high-resolution low-computation visual recognition (2022). arXiv preprint. arXiv:2205.14756
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, Berlin (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Acknowledgements
This work was partly supported by the National R &D Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2022M3I7A1078936) and Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2019R1A6A1A03032119).
Author information
Authors and Affiliations
Contributions
S. Lee and K. Koo collaborated on the writing and review of the paper. They jointly authored the figures and text for the section titled ‘Introduction’, ‘Background’, and ‘Performance Analysis of Various ViT Models’. All authors collaborated on the investigation of the compact ViT models. S. O was primarily responsible for writing for the section titled ‘Architecture and Hierarchy Restructuring.’ J. Lee. led the writing for the section on ‘Encoder Block Enhancements.’ S.J. authored the text for the ‘Integrated Approaches’ section of the paper. G. Lee compiled and summarized the experimental results for the paper and was also responsible for creating and formatting all tables. H. Kim supervised and reviewed the entire process. All authors reviewed, edited, and approved the final manuscript. This collective effort ensured a comprehensive and well-rounded analysis of the compact ViT models, contributing to the overall quality and depth of the review paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by J. Gao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lee, S.I., Koo, K., Lee, J.H. et al. Vision transformer models for mobile/edge devices: a survey. Multimedia Systems 30, 109 (2024). https://doi.org/10.1007/s00530-024-01312-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01312-0