skip to main content
10.1145/3582649.3582676acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicigpConference Proceedingsconference-collections
research-article

Efficient-ViT: A Light-Weight Classification Model Based on CNN and ViT

Published: 07 April 2023 Publication History

Abstract

In view of the following problems of the Vision Transformer (ViT) model: a large number of parameters, the lack of global modeling ability and sensitivity to data enhancement. Inspired by MobileViT, based on Convolutional Neural Networks (CNN) and Vision Transformer (ViT), propose a light-weight classification model: Efficient-ViT. By introducing the following modules: Squeeze-and-Excitation Block (SE-Block), Overlapping Patch Embedding (OPE), Linear Spatial Reduction Attention (Linear SRA). While keeping the model compact, the local and global information of the input feature map can be effectively encoded and integrated. The local information of the feature map is processed by CNN, while the global information is processed by ViT, and then simply fused the captured information. The proposed model has both the inductive bias ability of the CNN model and the good properties of the global modeling of the ViT model, and can learn better feature representations. Classification experiments swere carried out on three datasets such as CIFAR10, CIFAR100, and Stanford Cars. The experimental results show that the proposed method achieves better results, and the accuracy of Top1 is improved by 3.32% (86.55% to 89.87%).

References

[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30:5998–6008.
[2]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint,2020, arXiv:2010.11929.
[3]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Training data-efficient image transformers & distillation through attention[J]. In International Conference on Machine Learning, 2021:10347–10357.
[4]
Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference[J]. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021:12259-12269.
[5]
Haiping Wu, Bin Xiao, Noel Codella, Cvt: Introducingconvolutions to vision transformers[J]. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021:22-31.
[6]
Howard, Andrew and Sandler, Mark and Chu, Grace, Searching for MobileNetV3[J]. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),2019:1314-1324.
[7]
Tete Xiao, Mannat Singh, Eric Mintun, Earlyconvolutions help transformers see better[J]. Advances in Neural Information Processing Systems. 2021,34:30392-30400.
[8]
St´ephane d'Ascoli, Hugo Touvron, Matthew Leavitt, Convit: Improving vision transformers with soft convolutional inductive biases[J]. In International Conference on Machine Learning, 2021:2286-2296.
[9]
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mobile-former: Bridging mobilenet and transformer[J]. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022:5270-5279.
[10]
Sachin Mehta, Mohammad Rastegari. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer[J]. arXiv preprint,2021, arXiv:2110.02178.
[11]
Tan M, Le Q. Efficientnetv2: Smaller models and faster training[J]. International Conference on Machine Learning, 2021:10096-10106.
[12]
Wang W, Xie E, Li X, Pvt v2: Improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.
[13]
Yuan K, Guo S, Liu Z, Incorporating convolution designs into visual transformers[J]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 579-588.
[14]
Dai Z, Liu H, Le Q V, Coatnet: Marrying convolution and attention for all data sizes[J]. Advances in Neural Information Processing Systems, 2021, 34: 3965-3977.
[15]
Sandler M, Howard A, Zhu M, Mobilenetv2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4510-4520.
[16]
Tseng, Ching-Hsun, Shin-Jye Lee, Jia-Nan Feng, Upanets: Learning from the universal pixel attention networks[J], arXiv preprint, 2021, arXiv:2103.08640.
[17]
Hassani, Ali, Steven Walton, Nikhil Shah, Escaping the big data paradigm with compact transformers[J]. arXiv preprint, 2021, arXiv:2104.05704.
[18]
He K, Zhang X, Ren S, Deep residual learning for image recognition[J]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770-778.
[19]
Shen, Zhengyang, Lingshen He, Zhouchen Lin, Pdo-econvs: Partial differential operator based equivariant convolutions[J]. In International Conference on Machine Learning, 2020:8697-8706.
[20]
Zheng, Yaowei, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation[J]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021:8156-8165.
[21]
Hasanpour, Seyyed Hossein, Mohammad Rouhani, Let's keep it simple, using simple architectures to outperform deeper and more complex architectures[J]. arXiv preprint, 2016, arXiv:1608.06037.
[22]
Touvron, Hugo, Piotr Bojanowski, Mathilde Caron, Resmlp: Feedforward networks for image classification with data-efficient training[J]. arXiv preprint, 2021, arXiv:2105.03404.

Cited By

View all
  • (2025)Application of Convolutional Neural Networks and Recurrent Neural Networks in Food SafetyFoods10.3390/foods1402024714:2(247)Online publication date: 14-Jan-2025
  • (2024)Heterogeneous Prototype Distillation With Support-Query Correlative Guidance for Few-Shot Remote Sensing Scene ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.340963762(1-18)Online publication date: 2024
  • (2024)An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint LearningApplied Artificial Intelligence10.1080/08839514.2024.237171238:1Online publication date: 28-Jun-2024
  • Show More Cited By

Index Terms

  1. Efficient-ViT: A Light-Weight Classification Model Based on CNN and ViT

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICIGP '23: Proceedings of the 2023 6th International Conference on Image and Graphics Processing
    January 2023
    246 pages
    ISBN:9781450398572
    DOI:10.1145/3582649
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 April 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Convolutional Neural Network
    2. Image Classification
    3. Model Fusion
    4. Vision Transformer

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICIGP 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 22 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Application of Convolutional Neural Networks and Recurrent Neural Networks in Food SafetyFoods10.3390/foods1402024714:2(247)Online publication date: 14-Jan-2025
    • (2024)Heterogeneous Prototype Distillation With Support-Query Correlative Guidance for Few-Shot Remote Sensing Scene ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.340963762(1-18)Online publication date: 2024
    • (2024)An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint LearningApplied Artificial Intelligence10.1080/08839514.2024.237171238:1Online publication date: 28-Jun-2024
    • (2024)IDP-Net: Industrial defect perception network based on cross-layer semantic information guidance and context concentration enhancementEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107677130(107677)Online publication date: Apr-2024
    • (2024)ICA-NetEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107134126:PDOnline publication date: 27-Feb-2024
    • (2024)PQ-SAM: Post-training Quantization for Segment Anything ModelComputer Vision – ECCV 202410.1007/978-3-031-72684-2_24(420-437)Online publication date: 3-Nov-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media