Skip to main content
Log in

A comprehensive system for image scene classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In recent years, image scene classification based on low/high-level features has been considered as one of the most important and challenging problems faced in image processing research. The high-level features based on semantic concepts present a more accurate and closer model to the human perception of the image scene content. This paper presents a new multi-stage approach for image scene classification based on high-level semantic features extracted from image content. In the first stage, the object boundaries and their labels that represent the content are extracted. For this purpose, a combined method of a fully convolutional deep network and a combined network of a two-class SVM-fuzzy and SVR are used. Topic modeling is used to represent the latent relationships between the objects. Hence in the second stage, a new combination of methods consisting of the bag of visual words, and supervised document neural autoregressive distribution estimator is used to extract the latent topics (topic modeling) in the image. Finally, classification based on Bayesian method is performed according to the extracted features of the deep network, objects labels and the latent topics in the image. The proposed method has been evaluated on three datasets: Scene15, UIUC Sports, and MIT-67 Indoor. The experimental results show that the proposed approach achieves average performance improvement of 12%, 11% and 14% in the accuracy of object detection, and 0.5%, 0.6% and 1.8% in the mean average precision criteria of the image scene classification, compared to the previous state-of-the-art methods on these three datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Bishop C (2006) Pattern Recognition and Machine Learning (Information Science and Statistics) chapter 3:138–147

    Google Scholar 

  2. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 127–134

  3. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  4. Bosch A, Zisserman A, Muñoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell 30(4):712–727

    Article  Google Scholar 

  5. Boureau Y-L, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2559–2566

  6. Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Mining Knowl Discov 2(2):121–167

    Article  Google Scholar 

  7. Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2016) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197

    Article  Google Scholar 

  8. Chang X, Yu Y-L, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632

    Article  Google Scholar 

  9. Chen S-G, Wu X-J (2017) A new fuzzy twin support vector machine for pattern classification. Int J Mach Learn Cybern, 1–12

  10. Czarnecki WM, Jozefowicz R, Tabor J (2015) Maximum entropy linear manifold for learning discriminative low-dimensional representation. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 52–67

  11. Dai J, Li Y, He K, Sun J (2016) R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387

  12. Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2974–2983

  13. Doersch C, Gupta A, Efros AA (2013) Mid-level visual element discovery as discriminative mode seeking. In: Advances in neural information processing systems, pp 494–502

  14. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  15. Fu C-Y, Liu W, Ranga A, Tyagi A, Berg AC Dssd: deconvolutional single shot detector, arXiv:1701.06659

  16. Gao S, Tsang IW-H, Chia L-T, Zhao P (2010) Local features are not lonely–Laplacian sparse coding for image classification. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3555–3561

  17. Girshick R Fast r-cnn, arXiv:1504.08083

  18. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  19. Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, pp 392–407

  20. Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 902–909

  21. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37 (9):1904–1916

    Article  Google Scholar 

  22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  23. Ioffe S, Szegedy C Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv:1502.03167

  24. Isa D, Lee LH, Kallimani V, Rajkumar R (2008) Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20(9):1264–1272

    Article  Google Scholar 

  25. Janocha K, Czarnecki WM On loss functions for deep neural networks in classification, arXiv:1702.05659

  26. Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: 2011 IEEE International conference on computer vision (ICCV). IEEE, pp 2407–2414

  27. Jiang Y, Yuan J, Yu G (2012) Randomized spatial partition for scene recognition. In: Computer vision–ECCV 2012. Springer, pp 730–743

  28. Juneja M, Vedaldi A, Jawahar C, Zisserman A (2013) Blocks that shout: distinctive parts for scene classification. In: 2013 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 923–930

  29. Keller JM, Hunt DJ (1985) Incorporating fuzzy membership functions into the perceptron algorithm. IEEE Trans Pattern Anal Mach Intell 6:693–699

    Article  Google Scholar 

  30. Khan SH, Hayat M, Bennamoun M, Togneri R, Sohel FA (2016) A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans Image Process 25(7):3372–3383

    Article  MathSciNet  Google Scholar 

  31. Kwitt R, Vasconcelos N, Rasiwasia N (2012) Scene recognition on the semantic manifold. In: European conference on computer vision. Springer, pp 359–372

  32. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 2169–2178

  33. Lenc K, Vedaldi A R-cnn minus r, arXiv:1506.06981

  34. Li L-J, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. In: IEEE 11th International conference on computer vision, 2007. ICCV 2007. IEEE, pp 1–8

  35. Li L-J, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Advances in neural information processing systems, pp 1378–1386

  36. Li L-J, Su H, Lim Y, Fei-Fei L (2010) Objects as attributes for scene classification. In: European conference on computer vision. Springer, pp 57–69

  37. Li Q, Wu J, Tu Z (2013) Harvesting mid-level visual concepts from large-scale internet images. In: 2013 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 851–858

  38. Li Y, Chen Y, Wang N, Zhang Z Scale-aware trident networks for object detection, arXiv:1901.01892

  39. Lin D, Lu C, Liao R, Jia J (2014) Learning important spatial pooling regions for scene classification. In: 2014 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3726– 3733

  40. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  41. Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: 2011 IEEE International conference on computer vision (ICCV). IEEE, pp 2486–2493

  42. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

  43. Liu X, Zhang R, Meng Z, Hong R, Liu G (2019) On fusing the latent deep CNN feature for image classification. World Wide Web 22(2):423–436

    Article  Google Scholar 

  44. Margolin R, Zelnik-Manor L, Tal A (2014) Otc: a novel local descriptor for scene classification. In: European conference on computer vision. Springer, pp 377–391

  45. Mesnil G, Rifai S, Bordes A, Glorot X, Bengio Y, Vincent P (2015) Unsupervised learning of semantics of object detections for scene categorization. In: Pattern recognition applications and methods. Springer, pp 209–224

  46. Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: 2010 Third International symposium on intelligent information technology and security informatics (IITSI), IEEE, pp 63–67

  47. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175

    Article  Google Scholar 

  48. Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: 2011 IEEE International conference on computer vision (ICCV). IEEE, pp 1307–1314

  49. Parizi SN, Oberlin JG, Felzenszwalb PF (2012) Reconfigurable models for scene recognition. In: 2012 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2775–2782

  50. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  51. Putthividhy D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent Dirichlet allocation for image annotation. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3408–3415

  52. Qi X, Li C-G, Zhao G, Hong X, Pietikäinen M (2016) Dynamic texture and scene classification by transferring deep image features. Neurocomputing 171:1230–1241

    Article  Google Scholar 

  53. Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: 2009 IEEE Conference on computer vision and pattern recognition, 2009. CVPR. IEEE, pp 413–420

  54. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 512–519

  55. Redmon J, Farhadi A Yolo9000: better, faster, stronger, arXiv preprint

  56. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  57. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  58. Sadeghi F, Tappen MF (2012) Latent pyramidal regions for recognizing scenes. In: European Conference on computer vision. Springer, pp 228–241

  59. Shabou A, LeBorgne H (2012) Locality-constrained and spatially regularized coding for scene categorization. In: 2012 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3618–3625

  60. Singh S, Gupta A, Efros AA (2012) Unsupervised discovery of mid-level discriminative patches. In: Computer vision–ECCV 2012. Springer, pp 73–86

  61. Socher R, Fei-Fei L (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 966–973

  62. Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep Boltzmann machines. In: Advances in neural information processing systems, pp 2222–2230

  63. Srivastava N, Salakhutdinov RR (2013) Discriminative transfer learning with tree-based priors. In: Advances in neural information processing systems, pp 2094–2102

  64. Sun J, Ponce J (2013) Learning discriminative part detectors for image classification and cosegmentation. In: 2013 IEEE International conference on computer vision (ICCV). IEEE, pp 3400–3407

  65. Sun C, Paluri M, Collobert R, Nevatia R, Bourdev L (2016) Pronet: learning to propose object-specific boxes for cascaded neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3485–3493

  66. Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes. In: European conference on computer vision. Springer, pp 776–789

  67. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  68. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM international conference on multimedia. ACM, pp 1469–1472

  69. Wang D, Mao K (2019) Task-generic semantic convolutional neural network for web text-aided image classification. Neurocomputing 329:103–115

    Article  Google Scholar 

  70. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE International conference on computer vision (ICCV). IEEE, pp 3551–3558

  71. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference. BMVA Press, pp 124–1

  72. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3360–3367

  73. Wu J, Rehg JM (2009) Beyond the euclidean distance: creating effective visual codebooks using the histogram intersection kernel. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 630–637

  74. Wu J, Rehg JM (2011) Centrist: a visual descriptor for scene categorization. IEEE Trans Pattern Anal Mach Intell 33(8):1489–1501

    Article  Google Scholar 

  75. Wu R, Wang B, Wang W, Yu Y (2015) Harvesting discriminative meta objects with deep cnn features for scene classification. In: 2015 IEEE International conference on computer vision (ICCV). IEEE, pp 1287–1295

  76. Yen S-J, Wu Y-C, Yang J-C, Lee Y-S, Lee C-J, Liu J-J (2013) A support vector machine-based context-ranking model for question answering. Inform Sci 224:77–87

    Article  Google Scholar 

  77. Yuan J, Yang M, Wu Y (2011) Mining discriminative co-occurrence patterns for visual recognition. In: 2011 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2777–2784

  78. Zhang F, Du B, Zhang L (2015) Saliency-guided unsupervised feature learning for scene classification. IEEE Trans Geosci Remote Sens 53(4):2175–2184

    Article  Google Scholar 

  79. Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4203–4212

  80. Zheng Y, Jiang Y-G, Xue X (2012) Learning hybrid part filters for scene recognition. In: European conference on computer vision. Springer, pp 172–185

  81. Zheng Y, Zhang Y-J, Larochelle H (2016) A deep and autoregressive approach for topic modeling of multimodal data. IEEE Trans Pattern Anal Mach Intell 38(6):1056–1069

    Article  Google Scholar 

  82. Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. In: European conference on computer vision. Springer, pp 141–154

  83. Zhu J, Li L-J, Fei-Fei L, Xing EP (2010) Large margin learning of upstream scene understanding models. In: Advances in neural information processing systems, pp 2586–2594

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Ghanbari Sorkhi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sorkhi, A.G., Hassanpour, H. & Fateh, M. A comprehensive system for image scene classification. Multimed Tools Appl 79, 18033–18058 (2020). https://doi.org/10.1007/s11042-019-08264-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08264-y

Keywords

Navigation