skip to main content
research-article

Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition

Published: 11 May 2021 Publication History

Abstract

In this work, we address the task of scene recognition from image data. A scene is a spatially correlated arrangement of various visual semantic contents also known as concepts, e.g., “chair,”  “car,”  “sky,”  etc. Representation learning using visual semantic content can be regarded as one of the most trivial ideas as it mimics the human behavior of perceiving visual information. Semantic multinomial (SMN) representation is one such representation that captures semantic information using posterior probabilities of concepts. The core part of obtaining SMN representation is the building of concept models. Therefore, it is necessary to have ground-truth (true) concept labels for every concept present in an image. Moreover, manual labeling of concepts is practically not feasible due to the large number of images in the dataset. To address this issue, we propose an approach for generating pseudo-concepts in the absence of true concept labels. We utilize the pre-trained deep CNN-based architectures where activation maps (filter responses) from convolutional layers are considered as initial cues to the pseudo-concepts. The non-significant activation maps are removed using the proposed filter-specific threshold-based approach that leads to the removal of non-prominent concepts from data. Further, we propose a grouping mechanism to group the same pseudo-concepts using subspace modeling of filter responses to achieve a non-redundant representation. Experimental studies show that generated SMN representation using pseudo-concepts achieves comparable results for scene recognition tasks on standard datasets like MIT-67 and SUN-397 even in the absence of true concept labels.

References

[1]
A. Barla, F. Odone, and A. Verri. 2003. Histogram intersection kernel for image classification. In Proceedings of the International Conference on Image Processing (ICIP’03), Vol. 3. III--513.
[2]
Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (September 2004), 1757--1771.
[3]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 1--27.
[4]
Ken Chatfield, Victor S. Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference (BMVC’11), Vol. 2. Dundee, Scotland, 8.
[5]
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference (BMVC’14). arxiv:cs/1405.3531
[6]
Xiaojuan Cheng, Jiwen Lu, Jianjiang Feng, Bo Yuan, and Jie Zhou. 2018. Scene recognition with objectness. Pattern Recognition 74 (2018), 474--487.
[7]
Fan R. K. Chung and Fan Chung Graham. 1997. Spectral Graph Theory. Number 92. American Mathematical Society.
[8]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. San Diego, CA, 886--893.
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.
[10]
A. D. Dileep and C. Chandra Sekhar. 2014. GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Transactions on Neural Networks and Learning Systems 25, 8 (August 2014), 1421--1432.
[11]
M. Dixit, Si Chen, Dashan Gao, N. Rasiwasia, and N. Vasconcelos. 2015. Scene classification with semantic Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2974--2983.
[12]
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). 647--655.
[13]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874.
[14]
L. Feng and B. Bhanu. 2016. Semantic concept co-occurrence patterns for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (April 2016), 785--799.
[15]
Ruth Fong and Andrea Vedaldi. 2018. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. arXiv preprint arXiv:1801.03454 (March 2018).
[16]
Bin-Bin Gao, Xiu-Shen Wei, Jianxin Wu, and Weiyao Lin. 2015. Deep spatial pyramid: The devil is once again in the details. arXiv preprint arXiv:1504.05277 (2015).
[17]
Gene H. Golub and Charles F. van Loan. 2013. Matrix Computations. Retrieved from http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.html.
[18]
Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of European Conference on Computer Vision (ECCV’14). 392--407.
[19]
Shikha Gupta, A. D. Dileep, and Veena Thenkanidiyoor. 2017. The semantic multinomial representation of images obtained using dynamic kernel based pseudo-concept SVMs. In Proceedings of National Conference on Communication (NCC’17). 1--6.
[20]
Shikha Gupta, Deepak Kumar Pradhan, Dileep Aroor Dinesh, and Veena Thenkanidiyoor. 2018. Deep spatial pyramid match kernel for scene classification. In Proceedings of the International Conference on Pattern Recognition Applications and Methods. 141--148.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[22]
John Henderson. 2005. Introduction to real-world scene perception. Visual Cognition 12, 6 (April 2005), 849--851.
[23]
Luis Herranz, Shuqiang Jiang, and Xiangyang Li. [n.d.]. Scene recognition with CNNs: Objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 571--579.
[24]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. [n.d.]. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.
[25]
Shuqiang Jiang, Gongwei Chen, Xinhang Song, and Linhu Liu. 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 5.
[26]
S. H. Khan, M. Hayat, M. Bennamoun, R. Togneri, and F. A. Sohel. 2016. A discriminative representation of convolutional features for indoor scene recognition. IEEE Transactions on Image Processing 25, 7 (July 2016), 3372--3383.
[27]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’12), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). 1097--1105.
[28]
H. Li, F. Meng, and K. N. Ngan. 2013. Co-salient object detection from multiple images. IEEE Transactions on Multimedia 15, 8 (December 2013), 1896--1909.
[29]
Li-Jia Li, Hao Su, Yongwhan Lim, and Li Fei-Fei. 2014. Object bank: An object-level image representation for high-level visual recognition. International Journal of Computer Vision 107, 1 (2014), 20--39.
[30]
Ping Li, Gennady Samorodnitsk, and John Hopcroft. 2013. Sign cauchy projections and chi-square kernel. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’13). 2571--2579.
[31]
Yao Li, Lingqiao Liu, Chunhua Shen, and Anton Van Den Hengel. 2017. Mining mid-level visual patterns with deep CNN activations. International Journal of Computer Vision 121, 3 (2017), 344--364.
[32]
Ce Liu, Jenny Yuen, and Antonio Torralba. 2009. Nonparametric scene parsing: Label transfer via dense scene alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1972--1979.
[33]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (November 2008), 2579--2605.
[34]
Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42, 3 (May 2001), 145--175.
[35]
Genevieve Patterson and James Hays. [n.d.]. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2751--2758.
[36]
Florent Perronnin, Jorge Sánchez, and Thomas Mensink. [n.d.]. Improving the Fisher kernel for large-scale image classification. In Proceedings of European Conference on Computer Vision (ECCV’10). 143--156.
[37]
Deepak Kumar Pradhan, Shikha Gupta, Veena Thenkanidiyoor, and Dileep Aroor Dinesh. 2017. Semantic multinomial representation for scene images using CNN-based pseudo-concepts and concept neural network. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics. Springer, 400--409.
[38]
Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 413--420.
[39]
Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia 9, 5 (2007), 923--938.
[40]
Nikhil Rasiwasia and Nuno Vasconcelos. 2012. Holistic context models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 5 (2012), 902--917.
[41]
Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (June 2013), 222--245.
[42]
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. FOSNet: An end-to-end trainable deep neural network for scene recognition. IEEE Access 8 (2020), 82066--82077.
[43]
Krishan Sharma, Shikha Gupta, Aroor Dinesh Dileep, and Renu Rameshan. [n.d.]. Scene image classification using reduced virtual feature representation in sparse framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 2701--2705.
[44]
H. Shi, H. Li, F. Meng, Q. Wu, L. Xu, and K. N. Ngan. 2018. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Transactions on Multimedia 20, 10 (October 2018), 2670--2682.
[45]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint:1409.1556 (September 2014).
[46]
Chiranjibi Sitaula, Yong Xiang, Yushu Zhang, Xuequan Lu, and Sunil Aryal. 2019. Indoor image representation by high-level semantic features. IEEE Access 7 (2019), 84967--84979.
[47]
Ning Sun, Wenli Li, Jixin Liu, Guang Han, and Cong Wu. 2019. Fusing object semantics and deep appearance features for scene recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2019), 1715--1728.
[48]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.
[49]
Pengjie Tang, Hanli Wang, and Sam Kwong. 2017. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188--197.
[50]
Y. Tang, X. Wang, E. Dellandréa, and L. Chen. 2017. Weakly supervised learning of deformable part-based models for object detection via region proposals. IEEE Transactions on Multimedia 19, 2 (February 2017), 393--407.
[51]
Julia Vogel and Bernt Schiele. 2004. Natural scene retrieval based on a semantic modeling step. In Proceedings of the International Conference on Image and Video Retrieval (CIVR’04). 207--215.
[52]
Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395--416.
[53]
V. Wan and S. Renals. 2002. Evaluation of kernel methods for speaker verification and identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02). 669--672.
[54]
Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1287--1295.
[55]
Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3485--3492.
[56]
Guo-Sen Xie, Xu Zhang, Shuicheng Yan, and Cheng-Lin Liu. 2015. Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology 27, 6 (2015), 1263--1274.
[57]
Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.
[58]
Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).
[59]
Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2015. Multi-scale pyramid pooling for deep convolutional representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15) Workshops. 71--80.
[60]
Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In Proceedings of the Deep Learning Workshop in International Conference on Machine Learning (ICML’15).
[61]
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.
[62]
J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (October 2018), 2801--2813.
[63]
Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. [n.d.]. Deep semantic ranking based hashing for multi-label image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1556--1564.
[64]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452--1464.
[65]
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’14). 487--495.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 2
May 2021
410 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3461621
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2021
Accepted: 01 November 2020
Revised: 01 August 2020
Received: 01 September 2019
Published in TOMM Volume 17, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Pseudo-concept
  2. pseudo-concept modeling
  3. semantic multinomial representation
  4. subspace modeling
  5. support vector machine

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Mapping the learning curves of deep learning networksPLOS Computational Biology10.1371/journal.pcbi.101228621:2(e1012286)Online publication date: 10-Feb-2025
  • (2025)Feature selection through adaptive sparse learning for scene recognitionApplied Soft Computing10.1016/j.asoc.2024.112439169:COnline publication date: 1-Jan-2025
  • (2025)Semantic image representation for image recognition and retrieval using multilayer variational auto-encoder, InceptionNet and low-level image featuresThe Journal of Supercomputing10.1007/s11227-024-06792-581:1Online publication date: 1-Jan-2025
  • (2024)A bat biomimetic model for scenario recognition using echo Doppler informationBioinspiration & Biomimetics10.1088/1748-3190/ad262d19:2(026015)Online publication date: 21-Feb-2024
  • (2023)Attention-Augmented Memory Network for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357016619:3(1-24)Online publication date: 25-Feb-2023
  • (2023)Aligning Image Semantics and Label Concepts for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027819:2(1-23)Online publication date: 6-Feb-2023
  • (2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
  • (2023)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 23-Jan-2023
  • (2023)Double Attention Based on Graph Attention Network for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351903019:1(1-23)Online publication date: 5-Jan-2023
  • (2023)AABLSTM: A Novel Multi-task Based CNN-RNN Deep Model for Fashion AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351902919:1(1-18)Online publication date: 5-Jan-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media