Abstract
Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv\(^*\)), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.
Similar content being viewed by others
Data Availibility Statement
The datasets used during the study are available at http://weegee.vision.ucmerced.edu/datasets/landuse.html, https://captain-whu.github.io/AID/ and https://gcheng-nwpu.github.io/.
References
Zhang Wei, Tang Ping, Zhao Lijun (2019) Remote sensing image scene classification using CNN-CapsNet. Remote Sensing 11(5):494
Cheng Gong, Yang Ceyuan, Yao Xiwen, Guo Lei, Han Junwei (2018) When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE transactions on geoscience and remote sensing 56(5):2811–2821
Wang J, Zhong Y, Zheng Z, Ma A, Zhang L (2020) Rsnet: The search for remote sensing deep neural networks in recognition tasks. IEEE Transactions on Geoscience and Remote Sensing 59(3):2520–2534
Yu D, Xu Q, Guo H, Lu J, Lin Y, Liu X (2022) Aggregating features from dual paths for remote sensing image scene classification. IEEE Access 10:16740–16755
Zhang W, Jiao L, Liu F, Liu J, Cui Z (2022) Lhnet: Laplacian convolutional block for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–13
Xu K, Huang H, Li Y, Shi G (2020) Multilayer feature fusion network for scene classification in remote sensing. IEEE Geoscience and Remote Sensing Letters 17(11):1894–1898
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, e.a. Sylvain: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2023)
Lv P, Wu W, Zhong Y, Du F, Zhang L (2022) Scvit: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing 60:1–12
Yu Y, Li Y, Wang J, Guan H, Li F, Xiao S, Tang E, Ding X (2022) C\(^2\)-capsvit: Cross-context and cross-scale capsule vision transformers for remote sensing image scene classification. IEEE Geoscience and Remote Sensing Letters 19:1–5
Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X.: Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. In: European Conference on Computer Vision (2022)
Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)
Wu, C., Wu, F., Qi, T., Huang, Y., Xie, X.: Fastformer: Additive attention can be all you need. In: European Conference on Computer Vision (2021)
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unified transformer for efficient spatiotemporal representation learning. In: International Conference on Learning Representations (2022)
Yang, C., Qiao, S., Yu, Q., Yuan, X., Zhu, Y., Yuille, A.L., Adam, H., Chen, L.-C.: Moat: Alternating mobile convolution and attention brings strong vision models. In: International Conference on Learning Representations (2023)
Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. (2021)
Li, Y., Yuan, G., Wen, Y., Hu, E., Evangelidis, G., Tulyakov, S., Wang, Y., Ren, J.: Efficientformer: Vision transformers at mobilenet speed. In: Conference on Neural Information Processing Systems (2022)
Bolya, Daniel and Fu, Cheng-Yang and Dai, Xiaoliang and Zhang, Peizhao and Hoffman, Judy: Hydra attention: Efficient attention with many heads. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pp. 35–49 (2023). Springer
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvtv2: Improved baselines with pyramid vision transformer. Computational Visual Media (2021)
Liu, J., Pan, Z., He, H., Cai, J., Zhuang, B.: Ecoformer: Energy-saving attention with linear complexity. In: NeurIPS (2022)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: International Conference on Learning Representations (2020)
Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Ren, B., Qin, M., Tang, H., Wang, Y.: Spvit: Enabling faster vision transformers via soft token pruning. In: European Conference on Computer Vision (2022)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)
Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 12–21 (2023)
Yang, Y., Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification. advances in geographic information systems, 270–279 (2010)
Xia G-S, Hu J, Hu F, Shi B, Bai X, Zhong Y, Zhang L, Lu X (2017) Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55(7):3965–3981
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10):1865–1883
Bi Q, Qin K, Li Z, Zhang H, Xu K, Xia G-S (2020) A multiple-instance densely-connected convnet for aerial scene classification. IEEE Transactions on Image Processing 29:4911–4926
Bi Q, Qin K, Zhang H, Xie J, Li Z, Xu K (2020) Apdc-net: Attention pooling-based convolutional network for aerial scene classification. IEEE Geoscience and Remote Sensing Letters 17(9):1603–1607
Bi Q, Qin K, Zhang H, Li Z, Xu K (2020) Radc-net: A residual attention based convolution network for aerial scene classification. Neurocomputing 377:345–359
Bi Q, Zhang H, Qin K (2021) Multi-scale stacking attention pooling for remote sensing scene classification. Neurocomputing 436:147–161
Wang X, Wang S, Ning C, Zhou H (2021) Enhanced feature pyramid network with deep semantic embedding for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing 59(9):7918–7932
Bi Q, Qin K, Zhang H, Xia G-S (2021) Local semantic enhanced convnet for aerial scene recognition. IEEE Transactions on Image Processing 30:6498–6511
Bazi Y, Bashmal L, Rahhal MMA, Dayil RA, Ajlan NA (2021) Vision transformers for remote sensing image classification. Remote Sensing 13(3):516
Deng, P., Xu, K., Huang, .H.: When cnns meet vision transformer: A joint framework for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2021)
Zhang J, Zhao H, Li J (2021) TRS: Transformers for remote sensing scene classification. Remote Sensing 13(20):4143
Zhang Y, Zheng X, Lu X (2021) Pairwise comparison network for remote-sensing scene classification. IEEE Geoscience and Remote Sensing Letters 19:1–5
Funding
This work was supported by the National Natural Science Foundation of China (No. 41971365, 41571401, 62102200), the Science and Technology Research Project of Henan Province (212102210492, 232102211058, 232102110299), the Key Research Projects of Henan Higher Education Institutions (No. 23A520053, 23B520030), the Doctoral Research Start-up Fund Project at Nanyang Institute of Technology, the General Project of Humanities and Social Sciences Research in Henan Province under Grants 2022-ZZJh-081, the Interdisciplinary Sciences Project of Nanyang Institute of Technology under Grants NGJC-2022-01.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. The study was mainly conceived and designed by D. G. The experiments were performed by Z. W and J. F. The first draft of the manuscript was written by D. G and all authors commented on previous versions of the manuscript. Z. W, Z. Z and Z. S edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, D., Wu, Z., Feng, J. et al. HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification. Appl Intell 53, 24947–24962 (2023). https://doi.org/10.1007/s10489-023-04725-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04725-y