Skip to main content
Log in

HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv\(^*\)), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availibility Statement

The datasets used during the study are available at http://weegee.vision.ucmerced.edu/datasets/landuse.html, https://captain-whu.github.io/AID/ and https://gcheng-nwpu.github.io/.

References

  1. Zhang Wei, Tang Ping, Zhao Lijun (2019) Remote sensing image scene classification using CNN-CapsNet. Remote Sensing 11(5):494

    Article  Google Scholar 

  2. Cheng Gong, Yang Ceyuan, Yao Xiwen, Guo Lei, Han Junwei (2018) When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE transactions on geoscience and remote sensing 56(5):2811–2821

    Article  Google Scholar 

  3. Wang J, Zhong Y, Zheng Z, Ma A, Zhang L (2020) Rsnet: The search for remote sensing deep neural networks in recognition tasks. IEEE Transactions on Geoscience and Remote Sensing 59(3):2520–2534

    Article  Google Scholar 

  4. Yu D, Xu Q, Guo H, Lu J, Lin Y, Liu X (2022) Aggregating features from dual paths for remote sensing image scene classification. IEEE Access 10:16740–16755

    Article  Google Scholar 

  5. Zhang W, Jiao L, Liu F, Liu J, Cui Z (2022) Lhnet: Laplacian convolutional block for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–13

    Google Scholar 

  6. Xu K, Huang H, Li Y, Shi G (2020) Multilayer feature fusion network for scene classification in remote sensing. IEEE Geoscience and Remote Sensing Letters 17(11):1894–1898

    Article  Google Scholar 

  7. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, e.a. Sylvain: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2023)

  8. Lv P, Wu W, Zhong Y, Du F, Zhang L (2022) Scvit: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–12

    Google Scholar 

  9. Yu Y, Li Y, Wang J, Guan H, Li F, Xiao S, Tang E, Ding X (2022) C\(^2\)-capsvit: Cross-context and cross-scale capsule vision transformers for remote sensing image scene classification. IEEE Geoscience and Remote Sensing Letters 19:1–5

    Google Scholar 

  10. Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X.: Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. In: European Conference on Computer Vision (2022)

  11. Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)

  12. Wu, C., Wu, F., Qi, T., Huang, Y., Xie, X.: Fastformer: Additive attention can be all you need. In: European Conference on Computer Vision (2021)

  13. Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unified transformer for efficient spatiotemporal representation learning. In: International Conference on Learning Representations (2022)

  14. Yang, C., Qiao, S., Yu, Q., Yuan, X., Zhu, Y., Yuille, A.L., Adam, H., Chen, L.-C.: Moat: Alternating mobile convolution and attention brings strong vision models. In: International Conference on Learning Representations (2023)

  15. Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. (2021)

  16. Li, Y., Yuan, G., Wen, Y., Hu, E., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: Efficientformer: Vision transformers at mobilenet speed. In: Conference on Neural Information Processing Systems (2022)

  17. Bolya, Daniel and Fu, Cheng-Yang and Dai, Xiaoliang and Zhang, Peizhao and Hoffman, Judy: Hydra attention: Efficient attention with many heads. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pp. 35–49 (2023). Springer

  18. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media (2021)

  19. Liu, J., Pan, Z., He, H., Cai, J., Zhuang, B.: Ecoformer: Energy-saving attention with linear complexity. In: NeurIPS (2022)

  20. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: International Conference on Learning Representations (2020)

  21. Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Ren, B., Qin, M., Tang, H., Wang, Y.: Spvit: Enabling faster vision transformers via soft token pruning. In: European Conference on Computer Vision (2022)

  22. Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)

  23. Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 12–21 (2023)

  24. Yang, Y., Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification. advances in geographic information systems, 270–279 (2010)

  25. Xia G-S, Hu J, Hu F, Shi B, Bai X, Zhong Y, Zhang L, Lu X (2017) Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55(7):3965–3981

    Article  Google Scholar 

  26. Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10):1865–1883

    Article  Google Scholar 

  27. Bi Q, Qin K, Li Z, Zhang H, Xu K, Xia G-S (2020) A multiple-instance densely-connected convnet for aerial scene classification. IEEE Transactions on Image Processing 29:4911–4926

  28. Bi Q, Qin K, Zhang H, Xie J, Li Z, Xu K (2020) Apdc-net: Attention pooling-based convolutional network for aerial scene classification. IEEE Geoscience and Remote Sensing Letters 17(9):1603–1607

    Article  Google Scholar 

  29. Bi Q, Qin K, Zhang H, Li Z, Xu K (2020) Radc-net: A residual attention based convolution network for aerial scene classification. Neurocomputing 377:345–359

    Article  Google Scholar 

  30. Bi Q, Zhang H, Qin K (2021) Multi-scale stacking attention pooling for remote sensing scene classification. Neurocomputing 436:147–161

    Article  Google Scholar 

  31. Wang X, Wang S, Ning C, Zhou H (2021) Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 59(9):7918–7932

    Article  Google Scholar 

  32. Bi Q, Qin K, Zhang H, Xia G-S (2021) Local semantic enhanced convnet for aerial scene recognition. IEEE Transactions on Image Processing 30:6498–6511

    Article  Google Scholar 

  33. Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, Ajlan NA (2021) Vision transformers for remote sensing image classification. Remote Sensing 13(3):516

    Article  Google Scholar 

  34. Deng, P., Xu, K., Huang, .H.: When cnns meet vision transformer: A joint framework for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2021)

  35. Zhang J, Zhao H, Li J (2021) TRS: Transformers for remote sensing scene classification. Remote Sensing 13(20):4143

    Article  Google Scholar 

  36. Zhang Y, Zheng X, Lu X (2021) Pairwise comparison network for remote-sensing scene classification. IEEE Geoscience and Remote Sensing Letters 19:1–5

Download references

Funding

This work was supported by the National Natural Science Foundation of China (No. 41971365, 41571401, 62102200), the Science and Technology Research Project of Henan Province (212102210492, 232102211058, 232102110299), the Key Research Projects of Henan Higher Education Institutions (No. 23A520053, 23B520030), the Doctoral Research Start-up Fund Project at Nanyang Institute of Technology, the General Project of Humanities and Social Sciences Research in Henan Province under Grants 2022-ZZJh-081, the Interdisciplinary Sciences Project of Nanyang Institute of Technology under Grants NGJC-2022-01.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. The study was mainly conceived and designed by D. G. The experiments were performed by Z. W and J. F. The first draft of the manuscript was written by D. G and all authors commented on previous versions of the manuscript. Z. W, Z. Z and Z. S edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Dongen Guo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, D., Wu, Z., Feng, J. et al. HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification. Appl Intell 53, 24947–24962 (2023). https://doi.org/10.1007/s10489-023-04725-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04725-y

Keywords

Navigation