HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Guo, Dongen; Wu, Zechen; Feng, Jiangfan; Zhou, Zhuoke; Shen, Zhen

doi:10.1007/s10489-023-04725-y

HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Published: 31 July 2023

Volume 53, pages 24947–24962, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Dongen Guo ORCID: orcid.org/0000-0003-3927-7616¹,
Zechen Wu¹,
Jiangfan Feng²,
Zhuoke Zhou¹ &
…
Zhen Shen¹

302 Accesses
Explore all metrics

Abstract

Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv\(^*\)), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FCT: fusing CNN and transformer for scene classification

Article 15 September 2022

Deep Attention Network for Remote Sensing Scene Classification

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

Article Open access 14 September 2022

Data Availibility Statement

The datasets used during the study are available at http://weegee.vision.ucmerced.edu/datasets/landuse.html, https://captain-whu.github.io/AID/ and https://gcheng-nwpu.github.io/.

References

Zhang Wei, Tang Ping, Zhao Lijun (2019) Remote sensing image scene classification using CNN-CapsNet. Remote Sensing 11(5):494
Article Google Scholar
Cheng Gong, Yang Ceyuan, Yao Xiwen, Guo Lei, Han Junwei (2018) When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE transactions on geoscience and remote sensing 56(5):2811–2821
Article Google Scholar
Wang J, Zhong Y, Zheng Z, Ma A, Zhang L (2020) Rsnet: The search for remote sensing deep neural networks in recognition tasks. IEEE Transactions on Geoscience and Remote Sensing 59(3):2520–2534
Article Google Scholar
Yu D, Xu Q, Guo H, Lu J, Lin Y, Liu X (2022) Aggregating features from dual paths for remote sensing image scene classification. IEEE Access 10:16740–16755
Article Google Scholar
Zhang W, Jiao L, Liu F, Liu J, Cui Z (2022) Lhnet: Laplacian convolutional block for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–13
Google Scholar
Xu K, Huang H, Li Y, Shi G (2020) Multilayer feature fusion network for scene classification in remote sensing. IEEE Geoscience and Remote Sensing Letters 17(11):1894–1898
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, e.a. Sylvain: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2023)
Lv P, Wu W, Zhong Y, Du F, Zhang L (2022) Scvit: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–12
Google Scholar
Yu Y, Li Y, Wang J, Guan H, Li F, Xiao S, Tang E, Ding X (2022) C\(^2\)-capsvit: Cross-context and cross-scale capsule vision transformers for remote sensing image scene classification. IEEE Geoscience and Remote Sensing Letters 19:1–5
Google Scholar
Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X.: Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. In: European Conference on Computer Vision (2022)
Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)
Wu, C., Wu, F., Qi, T., Huang, Y., Xie, X.: Fastformer: Additive attention can be all you need. In: European Conference on Computer Vision (2021)
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unified transformer for efficient spatiotemporal representation learning. In: International Conference on Learning Representations (2022)
Yang, C., Qiao, S., Yu, Q., Yuan, X., Zhu, Y., Yuille, A.L., Adam, H., Chen, L.-C.: Moat: Alternating mobile convolution and attention brings strong vision models. In: International Conference on Learning Representations (2023)
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. (2021)
Li, Y., Yuan, G., Wen, Y., Hu, E., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: Efficientformer: Vision transformers at mobilenet speed. In: Conference on Neural Information Processing Systems (2022)
Bolya, Daniel and Fu, Cheng-Yang and Dai, Xiaoliang and Zhang, Peizhao and Hoffman, Judy: Hydra attention: Efficient attention with many heads. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pp. 35–49 (2023). Springer
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media (2021)
Liu, J., Pan, Z., He, H., Cai, J., Zhuang, B.: Ecoformer: Energy-saving attention with linear complexity. In: NeurIPS (2022)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: International Conference on Learning Representations (2020)
Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Ren, B., Qin, M., Tang, H., Wang, Y.: Spvit: Enabling faster vision transformers via soft token pruning. In: European Conference on Computer Vision (2022)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)
Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 12–21 (2023)
Yang, Y., Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification. advances in geographic information systems, 270–279 (2010)
Xia G-S, Hu J, Hu F, Shi B, Bai X, Zhong Y, Zhang L, Lu X (2017) Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55(7):3965–3981
Article Google Scholar
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10):1865–1883
Article Google Scholar
Bi Q, Qin K, Li Z, Zhang H, Xu K, Xia G-S (2020) A multiple-instance densely-connected convnet for aerial scene classification. IEEE Transactions on Image Processing 29:4911–4926
Bi Q, Qin K, Zhang H, Xie J, Li Z, Xu K (2020) Apdc-net: Attention pooling-based convolutional network for aerial scene classification. IEEE Geoscience and Remote Sensing Letters 17(9):1603–1607
Article Google Scholar
Bi Q, Qin K, Zhang H, Li Z, Xu K (2020) Radc-net: A residual attention based convolution network for aerial scene classification. Neurocomputing 377:345–359
Article Google Scholar
Bi Q, Zhang H, Qin K (2021) Multi-scale stacking attention pooling for remote sensing scene classification. Neurocomputing 436:147–161
Article Google Scholar
Wang X, Wang S, Ning C, Zhou H (2021) Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 59(9):7918–7932
Article Google Scholar
Bi Q, Qin K, Zhang H, Xia G-S (2021) Local semantic enhanced convnet for aerial scene recognition. IEEE Transactions on Image Processing 30:6498–6511
Article Google Scholar
Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, Ajlan NA (2021) Vision transformers for remote sensing image classification. Remote Sensing 13(3):516
Article Google Scholar
Deng, P., Xu, K., Huang, .H.: When cnns meet vision transformer: A joint framework for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2021)
Zhang J, Zhao H, Li J (2021) TRS: Transformers for remote sensing scene classification. Remote Sensing 13(20):4143
Article Google Scholar
Zhang Y, Zheng X, Lu X (2021) Pairwise comparison network for remote-sensing scene classification. IEEE Geoscience and Remote Sensing Letters 19:1–5

Download references

Funding

This work was supported by the National Natural Science Foundation of China (No. 41971365, 41571401, 62102200), the Science and Technology Research Project of Henan Province (212102210492, 232102211058, 232102110299), the Key Research Projects of Henan Higher Education Institutions (No. 23A520053, 23B520030), the Doctoral Research Start-up Fund Project at Nanyang Institute of Technology, the General Project of Humanities and Social Sciences Research in Henan Province under Grants 2022-ZZJh-081, the Interdisciplinary Sciences Project of Nanyang Institute of Technology under Grants NGJC-2022-01.

Author information

Authors and Affiliations

School of Computer and Software, Nanyang Institute of Technology, 80 Changjiang Road, Nanyang, 473004, Henan, China
Dongen Guo, Zechen Wu, Zhuoke Zhou & Zhen Shen
Chongqing Engineering Research Center for Spatial Big Data Intelligent Technology, Chongqing University of Posts and Telecommunications, No. 2, Chongwen Road, Chongqing, 400065, Chongqing, China
Jiangfan Feng

Authors

Dongen Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zechen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jiangfan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoke Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Shen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. The study was mainly conceived and designed by D. G. The experiments were performed by Z. W and J. F. The first draft of the manuscript was written by D. G and all authors commented on previous versions of the manuscript. Z. W, Z. Z and Z. S edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Dongen Guo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Guo, D., Wu, Z., Feng, J. et al. HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification. Appl Intell 53, 24947–24962 (2023). https://doi.org/10.1007/s10489-023-04725-y

Download citation

Accepted: 22 May 2023
Published: 31 July 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-04725-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Abstract

Access this article

Similar content being viewed by others

FCT: fusing CNN and transformer for scene classification

Deep Attention Network for Remote Sensing Scene Classification

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

Data Availibility Statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Abstract

Access this article

Similar content being viewed by others

FCT: fusing CNN and transformer for scene classification

Deep Attention Network for Remote Sensing Scene Classification

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

Data Availibility Statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation