FCT: fusing CNN and transformer for scene classification

Xie, Yuxiang; Yan, Jie; Kang, Lai; Guo, Yanming; Zhang, Jiahui; Luan, Xidao

doi:10.1007/s13735-022-00252-7

FCT: fusing CNN and transformer for scene classification

Regular Paper
Published: 15 September 2022

Volume 11, pages 611–618, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Yuxiang Xie¹,
Jie Yan¹,
Lai Kang¹,
Yanming Guo ORCID: orcid.org/0000-0001-9184-5313¹,
Jiahui Zhang¹ &
…
Xidao Luan²

654 Accesses
4 Citations
Explore all metrics

Abstract

Scene classification based on convolutional neural networks (CNNs) has achieved great success in recent years. In CNNs, the convolution operation performs well in extracting local features, but its ability to capture global feature representations is limited. In vision transformer (ViT), the self-attention mechanism can capture long-term feature dependencies, but it breaks down the details of local features. In this work, we make full use of the advantages of the CNN and ViT and propose a Transformer-based framework that combines CNN to improve the discriminative ability of features for scene classification. Specifically, we take the deep convolutional feature as the input and establish the scene Transformer module to extract the global feature in the scene image. An end-to-end scene classification framework called the FCT is built by fusing the CNN and scene Transformer module. Experimental results show that our FCT achieves a new state-of-the-art performance on two standard benchmarks MIT Indoor 67 and SUN 397, with the accuracy of 90.75% and 77.50%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Article 30 January 2023

Scene representation using a new two-branch neural network model

Article 01 December 2023

A Novel Method for Scene Classification Feeding Mid-Level Image Patch to Convolutional Neural Networks

Data availability

All data included in this study are available upon request by contact with the corresponding author.

References

Peng Z, Huang W, Gu S et al (2021) Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 367–376
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Google Scholar
Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International conference on machine learning. PMLR, pp 1691–1703
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scalet. arXiv preprint arXiv:2010.11929
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Zhu X, Su W, Lu L, et al (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Zheng S, Lu J, Zhao H et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6881–6890
Chen H, Wang Y, Guo T et al (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310
Zhou B, Lapedriza A, Khosla A et al (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464
Article Google Scholar
Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4700–4708
Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8697–8710
Zhou B, Lapedriza A, Xiao J et al (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems 27
Zeng H, Song X, Chen G et al (2019) Learning scene attribute for scene recognition. IEEE Trans Multimed 22(6):1519–1530
Article Google Scholar
Patterson G, Xu C, Su H et al (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1):59–81
Article Google Scholar
Cheng G, Li Z, Yao X et al (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geosci Remote Sens Lett 14(10):1735–1739
Article Google Scholar
Li E, Xia J, Du P et al (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(10):5653–5665
Article Google Scholar
Liu Y, Chen Q, Chen W, et al (2018) Dictionary learning inspired deep network for scene recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, No 1
Chen Y, Dai X, Chen D et al (2021) Mobile-former: bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895
Herranz L, Jiang S, Li X (2016) Scene recognition with cnns: objects, scales and dataset bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 571–579
Xie L, Lee F, Liu L et al (2020) Scene recognition: a comprehensive survey. Pattern Recognit 102:107205
Article Google Scholar
Sharif Razavian A, Azizpour H, Sullivan J et al (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 806–813
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 413–420
Xiao J, Hays J, Ehinger KA, et al (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3485–3492
Wang Z, Wang L, Wang Y et al (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Trans Image Process 26(4):2028–2041
Article MathSciNet MATH Google Scholar
Dixit MD, Vasconcelos N (2016) Object based scene representations using fisher scores of local subspace projections. In: Advances in neural information processing systems 29
Liu L, Wang P, Shen C et al (2017) Compositional model based fisher vector coding for image classification. IEEE Trans Pattern Anal Mach Intell 39(12):2335–2348
Article Google Scholar
Li Y, Dixit M, Vasconcelos N (2017) Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5746–5754
Chen B, Li J, Wei G et al (2018) A novel localized and second order feature coding network for image recognition. Pattern Recognit 76:339–348
Article Google Scholar
Sicre R, Avrithis Y, Kijak E et al (2017) Unsupervised part learning for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6271–6279
Khan SH, Hayat M, Porikli F (2017) Scene categorization with spectral features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5638–5648
Chen G, Song X, Zeng H et al (2020) Scene recognition with prototype-agnostic scene layout. IEEE Trans Image Process 29:5877–5888
Article MATH Google Scholar
Qiu J, Yang Y, Wang X et al (2021) Scene essence. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8322–8333
Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357
López-Cifuentes A, Escudero-Viñlo M, Bescós J et al (2020) Semantic-aware scene recognition. Pattern Recognit 102:107256
Article Google Scholar
Laranjeira C, Lacerda A, Nascimento ER (2019) On modeling context from objects with a long short-term memory for indoor scene recognition. In: 32nd SIBGRAPI conference on graphics, patterns and images, pp 249–256
Zeng H, Song X, Chen G et al (2022) Amorphous region context modeling for scene recognition. IEEE Trans Multimed 24:141–151
Article Google Scholar
Zhang J, Zhao H, Li J (2021) TRS: transformers for remote sensing scene classification. Remote Sens 13(20):4143
Article Google Scholar
Hao S, Wu B, Zhao K et al (2022) Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens 14(6):1507
Article Google Scholar
Lv P, Wu W, Zhong Y et al (2022) SCViT: a spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans Geosci Remote Sens 60:1–12
Google Scholar
Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 61873274.

Author information

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410000, Hunan, China
Yuxiang Xie, Jie Yan, Lai Kang, Yanming Guo & Jiahui Zhang
College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410000, Hunan, China
Xidao Luan

Authors

Yuxiang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yan
View author publications
You can also search for this author in PubMed Google Scholar
Lai Kang
View author publications
You can also search for this author in PubMed Google Scholar
Yanming Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xidao Luan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Corresponding author

Correspondence to Yanming Guo.

Ethics declarations

Interest Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xie, Y., Yan, J., Kang, L. et al. FCT: fusing CNN and transformer for scene classification. Int J Multimed Info Retr 11, 611–618 (2022). https://doi.org/10.1007/s13735-022-00252-7

Download citation

Received: 25 April 2022
Revised: 20 June 2022
Accepted: 28 August 2022
Published: 15 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13735-022-00252-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FCT: fusing CNN and transformer for scene classification

Abstract

Access this article

Similar content being viewed by others

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Scene representation using a new two-branch neural network model

A Novel Method for Scene Classification Feeding Mid-Level Image Patch to Convolutional Neural Networks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Interest Statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FCT: fusing CNN and transformer for scene classification

Abstract

Access this article

Similar content being viewed by others

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Scene representation using a new two-branch neural network model

A Novel Method for Scene Classification Feeding Mid-Level Image Patch to Convolutional Neural Networks

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Interest Statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation