Skip to main content
Log in

FCT: fusing CNN and transformer for scene classification

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Scene classification based on convolutional neural networks (CNNs) has achieved great success in recent years. In CNNs, the convolution operation performs well in extracting local features, but its ability to capture global feature representations is limited. In vision transformer (ViT), the self-attention mechanism can capture long-term feature dependencies, but it breaks down the details of local features. In this work, we make full use of the advantages of the CNN and ViT and propose a Transformer-based framework that combines CNN to improve the discriminative ability of features for scene classification. Specifically, we take the deep convolutional feature as the input and establish the scene Transformer module to extract the global feature in the scene image. An end-to-end scene classification framework called the FCT is built by fusing the CNN and scene Transformer module. Experimental results show that our FCT achieves a new state-of-the-art performance on two standard benchmarks MIT Indoor 67 and SUN 397, with the accuracy of 90.75% and 77.50%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

All data included in this study are available upon request by contact with the corresponding author.

References

  1. Peng Z, Huang W, Gu S et al (2021) Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 367–376

  2. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

    Google Scholar 

  3. Chen M, Radford A, Child R, et al (2020) Generative pretraining from pixels. In: International conference on machine learning. PMLR, pp 1691–1703

  4. Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scalet. arXiv preprint arXiv:2010.11929

  5. Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  6. Zhu X, Su W, Lu L, et al (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

  7. Zheng S, Lu J, Zhao H et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6881–6890

  8. Chen H, Wang Y, Guo T et al (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310

  9. Zhou B, Lapedriza A, Khosla A et al (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464

    Article  Google Scholar 

  10. Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255

  11. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems 25

  12. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  13. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778

  14. Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4700–4708

  15. Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8697–8710

  16. Zhou B, Lapedriza A, Xiao J et al (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems 27

  17. Zeng H, Song X, Chen G et al (2019) Learning scene attribute for scene recognition. IEEE Trans Multimed 22(6):1519–1530

    Article  Google Scholar 

  18. Patterson G, Xu C, Su H et al (2014) The sun attribute database: beyond categories for deeper scene understanding. Int J Comput Vis 108(1):59–81

    Article  Google Scholar 

  19. Cheng G, Li Z, Yao X et al (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geosci Remote Sens Lett 14(10):1735–1739

    Article  Google Scholar 

  20. Li E, Xia J, Du P et al (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(10):5653–5665

    Article  Google Scholar 

  21. Liu Y, Chen Q, Chen W, et al (2018) Dictionary learning inspired deep network for scene recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, No 1

  22. Chen Y, Dai X, Chen D et al (2021) Mobile-former: bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895

  23. Herranz L, Jiang S, Li X (2016) Scene recognition with cnns: objects, scales and dataset bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 571–579

  24. Xie L, Lee F, Liu L et al (2020) Scene recognition: a comprehensive survey. Pattern Recognit 102:107205

    Article  Google Scholar 

  25. Sharif Razavian A, Azizpour H, Sullivan J et al (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 806–813

  26. Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 413–420

  27. Xiao J, Hays J, Ehinger KA, et al (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3485–3492

  28. Wang Z, Wang L, Wang Y et al (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Trans Image Process 26(4):2028–2041

    Article  MathSciNet  MATH  Google Scholar 

  29. Dixit MD, Vasconcelos N (2016) Object based scene representations using fisher scores of local subspace projections. In: Advances in neural information processing systems 29

  30. Liu L, Wang P, Shen C et al (2017) Compositional model based fisher vector coding for image classification. IEEE Trans Pattern Anal Mach Intell 39(12):2335–2348

    Article  Google Scholar 

  31. Li Y, Dixit M, Vasconcelos N (2017) Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5746–5754

  32. Chen B, Li J, Wei G et al (2018) A novel localized and second order feature coding network for image recognition. Pattern Recognit 76:339–348

    Article  Google Scholar 

  33. Sicre R, Avrithis Y, Kijak E et al (2017) Unsupervised part learning for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6271–6279

  34. Khan SH, Hayat M, Porikli F (2017) Scene categorization with spectral features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5638–5648

  35. Chen G, Song X, Zeng H et al (2020) Scene recognition with prototype-agnostic scene layout. IEEE Trans Image Process 29:5877–5888

    Article  MATH  Google Scholar 

  36. Qiu J, Yang Y, Wang X et al (2021) Scene essence. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8322–8333

  37. Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357

  38. López-Cifuentes A, Escudero-Viñlo M, Bescós J et al (2020) Semantic-aware scene recognition. Pattern Recognit 102:107256

    Article  Google Scholar 

  39. Laranjeira C, Lacerda A, Nascimento ER (2019) On modeling context from objects with a long short-term memory for indoor scene recognition. In: 32nd SIBGRAPI conference on graphics, patterns and images, pp 249–256

  40. Zeng H, Song X, Chen G et al (2022) Amorphous region context modeling for scene recognition. IEEE Trans Multimed 24:141–151

    Article  Google Scholar 

  41. Zhang J, Zhao H, Li J (2021) TRS: transformers for remote sensing scene classification. Remote Sens 13(20):4143

    Article  Google Scholar 

  42. Hao S, Wu B, Zhao K et al (2022) Two-stream swin transformer with differentiable sobel operator for remote sensing image classification. Remote Sens 14(6):1507

    Article  Google Scholar 

  43. Lv P, Wu W, Zhong Y et al (2022) SCViT: a spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans Geosci Remote Sens 60:1–12

    Google Scholar 

  44. Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 61873274.

Author information

Authors and Affiliations

Authors

Contributions

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Corresponding author

Correspondence to Yanming Guo.

Ethics declarations

Interest Statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, Y., Yan, J., Kang, L. et al. FCT: fusing CNN and transformer for scene classification. Int J Multimed Info Retr 11, 611–618 (2022). https://doi.org/10.1007/s13735-022-00252-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-022-00252-7

Keywords

Navigation