research-article

Efficient-ViT: A Light-Weight Classification Model Based on CNN and ViT

Authors:

Yu LiaoAuthors Info & Claims

ICIGP '23: Proceedings of the 2023 6th International Conference on Image and Graphics Processing

Pages 64 - 70

https://doi.org/10.1145/3582649.3582676

Published: 07 April 2023 Publication History

Abstract

In view of the following problems of the Vision Transformer (ViT) model: a large number of parameters, the lack of global modeling ability and sensitivity to data enhancement. Inspired by MobileViT, based on Convolutional Neural Networks (CNN) and Vision Transformer (ViT), propose a light-weight classification model: Efficient-ViT. By introducing the following modules: Squeeze-and-Excitation Block (SE-Block), Overlapping Patch Embedding (OPE), Linear Spatial Reduction Attention (Linear SRA). While keeping the model compact, the local and global information of the input feature map can be effectively encoded and integrated. The local information of the feature map is processed by CNN, while the global information is processed by ViT, and then simply fused the captured information. The proposed model has both the inductive bias ability of the CNN model and the good properties of the global modeling of the ViT model, and can learn better feature representations. Classification experiments swere carried out on three datasets such as CIFAR10, CIFAR100, and Stanford Cars. The experimental results show that the proposed method achieves better results, and the accuracy of Top1 is improved by 3.32% (86.55% to 89.87%).

References

[1]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30:5998–6008.

[2]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint,2020, arXiv:2010.11929.

[3]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Training data-efficient image transformers & distillation through attention[J]. In International Conference on Machine Learning, 2021:10347–10357.

[4]

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference[J]. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021:12259-12269.

[5]

Haiping Wu, Bin Xiao, Noel Codella, Cvt: Introducingconvolutions to vision transformers[J]. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021:22-31.

[6]

Howard, Andrew and Sandler, Mark and Chu, Grace, Searching for MobileNetV3[J]. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),2019:1314-1324.

[7]

Tete Xiao, Mannat Singh, Eric Mintun, Earlyconvolutions help transformers see better[J]. Advances in Neural Information Processing Systems. 2021,34:30392-30400.

[8]

St´ephane d'Ascoli, Hugo Touvron, Matthew Leavitt, Convit: Improving vision transformers with soft convolutional inductive biases[J]. In International Conference on Machine Learning, 2021:2286-2296.

[9]

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mobile-former: Bridging mobilenet and transformer[J]. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022:5270-5279.

[10]

Sachin Mehta, Mohammad Rastegari. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer[J]. arXiv preprint,2021, arXiv:2110.02178.

[11]

Tan M, Le Q. Efficientnetv2: Smaller models and faster training[J]. International Conference on Machine Learning, 2021:10096-10106.

[12]

Wang W, Xie E, Li X, Pvt v2: Improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.

[13]

Yuan K, Guo S, Liu Z, Incorporating convolution designs into visual transformers[J]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 579-588.

[14]

Dai Z, Liu H, Le Q V, Coatnet: Marrying convolution and attention for all data sizes[J]. Advances in Neural Information Processing Systems, 2021, 34: 3965-3977.

[15]

Sandler M, Howard A, Zhu M, Mobilenetv2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4510-4520.

[16]

Tseng, Ching-Hsun, Shin-Jye Lee, Jia-Nan Feng, Upanets: Learning from the universal pixel attention networks[J], arXiv preprint, 2021, arXiv:2103.08640.

[17]

Hassani, Ali, Steven Walton, Nikhil Shah, Escaping the big data paradigm with compact transformers[J]. arXiv preprint, 2021, arXiv:2104.05704.

[18]

He K, Zhang X, Ren S, Deep residual learning for image recognition[J]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770-778.

[19]

Shen, Zhengyang, Lingshen He, Zhouchen Lin, Pdo-econvs: Partial differential operator based equivariant convolutions[J]. In International Conference on Machine Learning, 2020:8697-8706.

[20]

Zheng, Yaowei, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation[J]. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021:8156-8165.

[21]

Hasanpour, Seyyed Hossein, Mohammad Rouhani, Let's keep it simple, using simple architectures to outperform deeper and more complex architectures[J]. arXiv preprint, 2016, arXiv:1608.06037.

[22]

Touvron, Hugo, Piotr Bojanowski, Mathilde Caron, Resmlp: Feedforward networks for image classification with data-efficient training[J]. arXiv preprint, 2021, arXiv:2105.03404.

Cited By

Ding HHou HWang LCui XYu WWilson D(2025)Application of Convolutional Neural Networks and Recurrent Neural Networks in Food SafetyFoods10.3390/foods1402024714:2(247)Online publication date: 14-Jan-2025
https://doi.org/10.3390/foods14020247
Zhuang YLiu YZhang TChen LChen HLi L(2024)Heterogeneous Prototype Distillation With Support-Query Correlative Guidance for Few-Shot Remote Sensing Scene ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.340963762(1-18)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3409637
Gong LHe XYang J(2024)An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint LearningApplied Artificial Intelligence10.1080/08839514.2024.237171238:1Online publication date: 28-Jun-2024
https://doi.org/10.1080/08839514.2024.2371712
Show More Cited By

Index Terms

Efficient-ViT: A Light-Weight Classification Model Based on CNN and ViT
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation
      1. Image processing

Recommendations

Wavelet-Attention CNN for image classification
Abstract
The feature learning methods based on convolutional neural network (CNN) have successfully produced tremendous achievements in image classification tasks. However, the inherent noise and some other factors may weaken the effectiveness of the ...
Deep CNN for Classification of Image Contents
IPMV '21: Proceedings of the 2021 3rd International Conference on Image Processing and Machine Vision

In recent years the classification of images has made great progress and has been used in many fields. However, it may not be possible to classify images perfectly through the CNN because of overfitting and gradient vanishing. Most existing CNNs have ...
The Application of Vision Transformer in Image Classification
ICVARS '22: Proceedings of the 2022 6th International Conference on Virtual and Augmented Reality Simulations

This project aims to study the different performance between the Vision Transformer and a Convolu- tional Nerual Network. Google Colab will be used as the environment in this project. The dataset will use CIFAR-100 image dataset to train vision ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIGP '23: Proceedings of the 2023 6th International Conference on Image and Graphics Processing

January 2023

246 pages

ISBN:9781450398572

DOI:10.1145/3582649

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIGP 2023

ICIGP 2023: 2023 The 6th International Conference on Image and Graphics Processing

January 6 - 8, 2023

Chongqing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
163
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)8

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ding HHou HWang LCui XYu WWilson D(2025)Application of Convolutional Neural Networks and Recurrent Neural Networks in Food SafetyFoods10.3390/foods1402024714:2(247)Online publication date: 14-Jan-2025
https://doi.org/10.3390/foods14020247
Zhuang YLiu YZhang TChen LChen HLi L(2024)Heterogeneous Prototype Distillation With Support-Query Correlative Guidance for Few-Shot Remote Sensing Scene ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.340963762(1-18)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3409637
Gong LHe XYang J(2024)An Image-Text Sentiment Analysis Method Using Multi-Channel Multi-Modal Joint LearningApplied Artificial Intelligence10.1080/08839514.2024.237171238:1Online publication date: 28-Jun-2024
https://doi.org/10.1080/08839514.2024.2371712
Li GZhao SLi MZhou MYing Z(2024)IDP-Net: Industrial defect perception network based on cross-layer semantic information guidance and context concentration enhancementEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107677130(107677)Online publication date: Apr-2024
https://doi.org/10.1016/j.engappai.2023.107677
Zhao SLi GZhou MLi M(2024)ICA-NetEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107134126:PDOnline publication date: 27-Feb-2024
https://dl.acm.org/doi/10.1016/j.engappai.2023.107134
Liu XDing XYu LXi YLi WTu ZHu JChen HYin BXiong Z(2024)PQ-SAM: Post-training Quantization for Segment Anything ModelComputer Vision – ECCV 202410.1007/978-3-031-72684-2_24(420-437)Online publication date: 3-Nov-2024
https://doi.org/10.1007/978-3-031-72684-2_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents