Abstract
Scene classification based on convolutional neural networks (CNNs) has achieved great success in recent years. In CNNs, the convolution operation performs well in extracting local features, but its ability to capture global feature representations is limited. In vision transformer (ViT), the self-attention mechanism can capture long-term feature dependencies, but it breaks down the details of local features. In this work, we make full use of the advantages of the CNN and ViT and propose a Transformer-based framework that combines CNN to improve the discriminative ability of features for scene classification. Specifically, we take the deep convolutional feature as the input and establish the scene Transformer module to extract the global feature in the scene image. An end-to-end scene classification framework called the FCT is built by fusing the CNN and scene Transformer module. Experimental results show that our FCT achieves a new state-of-the-art performance on two standard benchmarks MIT Indoor 67 and SUN 397, with the accuracy of 90.75% and 77.50%, respectively.
Similar content being viewed by others
Data availability
All data included in this study are available upon request by contact with the corresponding author.
References
Peng Z, Huang W, Gu S et al (2021) Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 367–376
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International conference on machine learning. PMLR, pp 1691–1703
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scalet. arXiv preprint arXiv:2010.11929
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Zhu X, Su W, Lu L, et al (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Zheng S, Lu J, Zhao H et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6881–6890
Chen H, Wang Y, Guo T et al (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310
Zhou B, Lapedriza A, Khosla A et al (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464
Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4700–4708
Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8697–8710
Zhou B, Lapedriza A, Xiao J et al (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems 27
Zeng H, Song X, Chen G et al (2019) Learning scene attribute for scene recognition. IEEE Trans Multimed 22(6):1519–1530
Patterson G, Xu C, Su H et al (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1):59–81
Cheng G, Li Z, Yao X et al (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geosci Remote Sens Lett 14(10):1735–1739
Li E, Xia J, Du P et al (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(10):5653–5665
Liu Y, Chen Q, Chen W, et al (2018) Dictionary learning inspired deep network for scene recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, No 1
Chen Y, Dai X, Chen D et al (2021) Mobile-former: bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895
Herranz L, Jiang S, Li X (2016) Scene recognition with cnns: objects, scales and dataset bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 571–579
Xie L, Lee F, Liu L et al (2020) Scene recognition: a comprehensive survey. Pattern Recognit 102:107205
Sharif Razavian A, Azizpour H, Sullivan J et al (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 806–813
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 413–420
Xiao J, Hays J, Ehinger KA, et al (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3485–3492
Wang Z, Wang L, Wang Y et al (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Trans Image Process 26(4):2028–2041
Dixit MD, Vasconcelos N (2016) Object based scene representations using fisher scores of local subspace projections. In: Advances in neural information processing systems 29
Liu L, Wang P, Shen C et al (2017) Compositional model based fisher vector coding for image classification. IEEE Trans Pattern Anal Mach Intell 39(12):2335–2348
Li Y, Dixit M, Vasconcelos N (2017) Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5746–5754
Chen B, Li J, Wei G et al (2018) A novel localized and second order feature coding network for image recognition. Pattern Recognit 76:339–348
Sicre R, Avrithis Y, Kijak E et al (2017) Unsupervised part learning for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6271–6279
Khan SH, Hayat M, Porikli F (2017) Scene categorization with spectral features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5638–5648
Chen G, Song X, Zeng H et al (2020) Scene recognition with prototype-agnostic scene layout. IEEE Trans Image Process 29:5877–5888
Qiu J, Yang Y, Wang X et al (2021) Scene essence. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8322–8333
Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357
López-Cifuentes A, Escudero-Viñlo M, Bescós J et al (2020) Semantic-aware scene recognition. Pattern Recognit 102:107256
Laranjeira C, Lacerda A, Nascimento ER (2019) On modeling context from objects with a long short-term memory for indoor scene recognition. In: 32nd SIBGRAPI conference on graphics, patterns and images, pp 249–256
Zeng H, Song X, Chen G et al (2022) Amorphous region context modeling for scene recognition. IEEE Trans Multimed 24:141–151
Zhang J, Zhao H, Li J (2021) TRS: transformers for remote sensing scene classification. Remote Sens 13(20):4143
Hao S, Wu B, Zhao K et al (2022) Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens 14(6):1507
Lv P, Wu W, Zhong Y et al (2022) SCViT: a spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans Geosci Remote Sens 60:1–12
Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Acknowledgements
This research was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 61873274.
Author information
Authors and Affiliations
Contributions
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Corresponding author
Ethics declarations
Interest Statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, Y., Yan, J., Kang, L. et al. FCT: fusing CNN and transformer for scene classification. Int J Multimed Info Retr 11, 611–618 (2022). https://doi.org/10.1007/s13735-022-00252-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-022-00252-7