Abstract
In recent years, image scene classification based on low/high-level features has been considered as one of the most important and challenging problems faced in image processing research. The high-level features based on semantic concepts present a more accurate and closer model to the human perception of the image scene content. This paper presents a new multi-stage approach for image scene classification based on high-level semantic features extracted from image content. In the first stage, the object boundaries and their labels that represent the content are extracted. For this purpose, a combined method of a fully convolutional deep network and a combined network of a two-class SVM-fuzzy and SVR are used. Topic modeling is used to represent the latent relationships between the objects. Hence in the second stage, a new combination of methods consisting of the bag of visual words, and supervised document neural autoregressive distribution estimator is used to extract the latent topics (topic modeling) in the image. Finally, classification based on Bayesian method is performed according to the extracted features of the deep network, objects labels and the latent topics in the image. The proposed method has been evaluated on three datasets: Scene15, UIUC Sports, and MIT-67 Indoor. The experimental results show that the proposed approach achieves average performance improvement of 12%, 11% and 14% in the accuracy of object detection, and 0.5%, 0.6% and 1.8% in the mean average precision criteria of the image scene classification, compared to the previous state-of-the-art methods on these three datasets.
Similar content being viewed by others
References
Bishop C (2006) Pattern Recognition and Machine Learning (Information Science and Statistics) chapter 3:138–147
Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 127–134
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bosch A, Zisserman A, Muñoz X (2008) Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell 30(4):712–727
Boureau Y-L, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2559–2566
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Mining Knowl Discov 2(2):121–167
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2016) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Chang X, Yu Y-L, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Chen S-G, Wu X-J (2017) A new fuzzy twin support vector machine for pattern classification. Int J Mach Learn Cybern, 1–12
Czarnecki WM, Jozefowicz R, Tabor J (2015) Maximum entropy linear manifold for learning discriminative low-dimensional representation. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 52–67
Dai J, Li Y, He K, Sun J (2016) R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387
Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2974–2983
Doersch C, Gupta A, Efros AA (2013) Mid-level visual element discovery as discriminative mode seeking. In: Advances in neural information processing systems, pp 494–502
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Fu C-Y, Liu W, Ranga A, Tyagi A, Berg AC Dssd: deconvolutional single shot detector, arXiv:1701.06659
Gao S, Tsang IW-H, Chia L-T, Zhao P (2010) Local features are not lonely–Laplacian sparse coding for image classification. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3555–3561
Girshick R Fast r-cnn, arXiv:1504.08083
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, pp 392–407
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 902–909
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37 (9):1904–1916
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ioffe S, Szegedy C Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv:1502.03167
Isa D, Lee LH, Kallimani V, Rajkumar R (2008) Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans Knowl Data Eng 20(9):1264–1272
Janocha K, Czarnecki WM On loss functions for deep neural networks in classification, arXiv:1702.05659
Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: 2011 IEEE International conference on computer vision (ICCV). IEEE, pp 2407–2414
Jiang Y, Yuan J, Yu G (2012) Randomized spatial partition for scene recognition. In: Computer vision–ECCV 2012. Springer, pp 730–743
Juneja M, Vedaldi A, Jawahar C, Zisserman A (2013) Blocks that shout: distinctive parts for scene classification. In: 2013 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 923–930
Keller JM, Hunt DJ (1985) Incorporating fuzzy membership functions into the perceptron algorithm. IEEE Trans Pattern Anal Mach Intell 6:693–699
Khan SH, Hayat M, Bennamoun M, Togneri R, Sohel FA (2016) A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans Image Process 25(7):3372–3383
Kwitt R, Vasconcelos N, Rasiwasia N (2012) Scene recognition on the semantic manifold. In: European conference on computer vision. Springer, pp 359–372
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 2169–2178
Lenc K, Vedaldi A R-cnn minus r, arXiv:1506.06981
Li L-J, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. In: IEEE 11th International conference on computer vision, 2007. ICCV 2007. IEEE, pp 1–8
Li L-J, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Advances in neural information processing systems, pp 1378–1386
Li L-J, Su H, Lim Y, Fei-Fei L (2010) Objects as attributes for scene classification. In: European conference on computer vision. Springer, pp 57–69
Li Q, Wu J, Tu Z (2013) Harvesting mid-level visual concepts from large-scale internet images. In: 2013 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 851–858
Li Y, Chen Y, Wang N, Zhang Z Scale-aware trident networks for object detection, arXiv:1901.01892
Lin D, Lu C, Liao R, Jia J (2014) Learning important spatial pooling regions for scene classification. In: 2014 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3726– 3733
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: 2011 IEEE International conference on computer vision (ICCV). IEEE, pp 2486–2493
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
Liu X, Zhang R, Meng Z, Hong R, Liu G (2019) On fusing the latent deep CNN feature for image classification. World Wide Web 22(2):423–436
Margolin R, Zelnik-Manor L, Tal A (2014) Otc: a novel local descriptor for scene classification. In: European conference on computer vision. Springer, pp 377–391
Mesnil G, Rifai S, Bordes A, Glorot X, Bengio Y, Vincent P (2015) Unsupervised learning of semantics of object detections for scene categorization. In: Pattern recognition applications and methods. Springer, pp 209–224
Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: 2010 Third International symposium on intelligent information technology and security informatics (IITSI), IEEE, pp 63–67
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: 2011 IEEE International conference on computer vision (ICCV). IEEE, pp 1307–1314
Parizi SN, Oberlin JG, Felzenszwalb PF (2012) Reconfigurable models for scene recognition. In: 2012 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2775–2782
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Putthividhy D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent Dirichlet allocation for image annotation. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3408–3415
Qi X, Li C-G, Zhao G, Hong X, Pietikäinen M (2016) Dynamic texture and scene classification by transferring deep image features. Neurocomputing 171:1230–1241
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: 2009 IEEE Conference on computer vision and pattern recognition, 2009. CVPR. IEEE, pp 413–420
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 512–519
Redmon J, Farhadi A Yolo9000: better, faster, stronger, arXiv preprint
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Sadeghi F, Tappen MF (2012) Latent pyramidal regions for recognizing scenes. In: European Conference on computer vision. Springer, pp 228–241
Shabou A, LeBorgne H (2012) Locality-constrained and spatially regularized coding for scene categorization. In: 2012 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3618–3625
Singh S, Gupta A, Efros AA (2012) Unsupervised discovery of mid-level discriminative patches. In: Computer vision–ECCV 2012. Springer, pp 73–86
Socher R, Fei-Fei L (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 966–973
Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep Boltzmann machines. In: Advances in neural information processing systems, pp 2222–2230
Srivastava N, Salakhutdinov RR (2013) Discriminative transfer learning with tree-based priors. In: Advances in neural information processing systems, pp 2094–2102
Sun J, Ponce J (2013) Learning discriminative part detectors for image classification and cosegmentation. In: 2013 IEEE International conference on computer vision (ICCV). IEEE, pp 3400–3407
Sun C, Paluri M, Collobert R, Nevatia R, Bourdev L (2016) Pronet: learning to propose object-specific boxes for cascaded neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3485–3493
Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes. In: European conference on computer vision. Springer, pp 776–789
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM international conference on multimedia. ACM, pp 1469–1472
Wang D, Mao K (2019) Task-generic semantic convolutional neural network for web text-aided image classification. Neurocomputing 329:103–115
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE International conference on computer vision (ICCV). IEEE, pp 3551–3558
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference. BMVA Press, pp 124–1
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3360–3367
Wu J, Rehg JM (2009) Beyond the euclidean distance: creating effective visual codebooks using the histogram intersection kernel. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 630–637
Wu J, Rehg JM (2011) Centrist: a visual descriptor for scene categorization. IEEE Trans Pattern Anal Mach Intell 33(8):1489–1501
Wu R, Wang B, Wang W, Yu Y (2015) Harvesting discriminative meta objects with deep cnn features for scene classification. In: 2015 IEEE International conference on computer vision (ICCV). IEEE, pp 1287–1295
Yen S-J, Wu Y-C, Yang J-C, Lee Y-S, Lee C-J, Liu J-J (2013) A support vector machine-based context-ranking model for question answering. Inform Sci 224:77–87
Yuan J, Yang M, Wu Y (2011) Mining discriminative co-occurrence patterns for visual recognition. In: 2011 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 2777–2784
Zhang F, Du B, Zhang L (2015) Saliency-guided unsupervised feature learning for scene classification. IEEE Trans Geosci Remote Sens 53(4):2175–2184
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4203–4212
Zheng Y, Jiang Y-G, Xue X (2012) Learning hybrid part filters for scene recognition. In: European conference on computer vision. Springer, pp 172–185
Zheng Y, Zhang Y-J, Larochelle H (2016) A deep and autoregressive approach for topic modeling of multimodal data. IEEE Trans Pattern Anal Mach Intell 38(6):1056–1069
Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. In: European conference on computer vision. Springer, pp 141–154
Zhu J, Li L-J, Fei-Fei L, Xing EP (2010) Large margin learning of upstream scene understanding models. In: Advances in neural information processing systems, pp 2586–2594
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sorkhi, A.G., Hassanpour, H. & Fateh, M. A comprehensive system for image scene classification. Multimed Tools Appl 79, 18033–18058 (2020). https://doi.org/10.1007/s11042-019-08264-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08264-y