research-article

Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition

Authors:

Krishan Sharma,

Dileep Aroor Dinesh,

Veena ThenkanidiyoorAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 17, Issue 2

Article No.: 53, Pages 1 - 24

https://doi.org/10.1145/3436494

Published: 11 May 2021 Publication History

Abstract

In this work, we address the task of scene recognition from image data. A scene is a spatially correlated arrangement of various visual semantic contents also known as concepts, e.g., “chair,” “car,” “sky,” etc. Representation learning using visual semantic content can be regarded as one of the most trivial ideas as it mimics the human behavior of perceiving visual information. Semantic multinomial (SMN) representation is one such representation that captures semantic information using posterior probabilities of concepts. The core part of obtaining SMN representation is the building of concept models. Therefore, it is necessary to have ground-truth (true) concept labels for every concept present in an image. Moreover, manual labeling of concepts is practically not feasible due to the large number of images in the dataset. To address this issue, we propose an approach for generating pseudo-concepts in the absence of true concept labels. We utilize the pre-trained deep CNN-based architectures where activation maps (filter responses) from convolutional layers are considered as initial cues to the pseudo-concepts. The non-significant activation maps are removed using the proposed filter-specific threshold-based approach that leads to the removal of non-prominent concepts from data. Further, we propose a grouping mechanism to group the same pseudo-concepts using subspace modeling of filter responses to achieve a non-redundant representation. Experimental studies show that generated SMN representation using pseudo-concepts achieves comparable results for scene recognition tasks on standard datasets like MIT-67 and SUN-397 even in the absence of true concept labels.

References

[1]

A. Barla, F. Odone, and A. Verri. 2003. Histogram intersection kernel for image classification. In Proceedings of the International Conference on Image Processing (ICIP’03), Vol. 3. III--513.

[2]

Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (September 2004), 1757--1771.

[3]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 1--27.

Digital Library

[4]

Ken Chatfield, Victor S. Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference (BMVC’11), Vol. 2. Dundee, Scotland, 8.

[5]

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference (BMVC’14). arxiv:cs/1405.3531

[6]

Xiaojuan Cheng, Jiwen Lu, Jianjiang Feng, Bo Yuan, and Jie Zhou. 2018. Scene recognition with objectness. Pattern Recognition 74 (2018), 474--487.

Digital Library

[7]

Fan R. K. Chung and Fan Chung Graham. 1997. Spectral Graph Theory. Number 92. American Mathematical Society.

[8]

Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. San Diego, CA, 886--893.

Digital Library

[9]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.

[10]

A. D. Dileep and C. Chandra Sekhar. 2014. GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Transactions on Neural Networks and Learning Systems 25, 8 (August 2014), 1421--1432.

[11]

M. Dixit, Si Chen, Dashan Gao, N. Rasiwasia, and N. Vasconcelos. 2015. Scene classification with semantic Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2974--2983.

[12]

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). 647--655.

Digital Library

[13]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874.

Digital Library

[14]

L. Feng and B. Bhanu. 2016. Semantic concept co-occurrence patterns for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (April 2016), 785--799.

Digital Library

[15]

Ruth Fong and Andrea Vedaldi. 2018. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. arXiv preprint arXiv:1801.03454 (March 2018).

[16]

Bin-Bin Gao, Xiu-Shen Wei, Jianxin Wu, and Weiyao Lin. 2015. Deep spatial pyramid: The devil is once again in the details. arXiv preprint arXiv:1504.05277 (2015).

[17]

Gene H. Golub and Charles F. van Loan. 2013. Matrix Computations. Retrieved from http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.html.

[18]

Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of European Conference on Computer Vision (ECCV’14). 392--407.

[19]

Shikha Gupta, A. D. Dileep, and Veena Thenkanidiyoor. 2017. The semantic multinomial representation of images obtained using dynamic kernel based pseudo-concept SVMs. In Proceedings of National Conference on Communication (NCC’17). 1--6.

[20]

Shikha Gupta, Deepak Kumar Pradhan, Dileep Aroor Dinesh, and Veena Thenkanidiyoor. 2018. Deep spatial pyramid match kernel for scene classification. In Proceedings of the International Conference on Pattern Recognition Applications and Methods. 141--148.

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[22]

John Henderson. 2005. Introduction to real-world scene perception. Visual Cognition 12, 6 (April 2005), 849--851.

[23]

Luis Herranz, Shuqiang Jiang, and Xiangyang Li. [n.d.]. Scene recognition with CNNs: Objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 571--579.

[24]

Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. [n.d.]. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.

[25]

Shuqiang Jiang, Gongwei Chen, Xinhang Song, and Linhu Liu. 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 5.

Digital Library

[26]

S. H. Khan, M. Hayat, M. Bennamoun, R. Togneri, and F. A. Sohel. 2016. A discriminative representation of convolutional features for indoor scene recognition. IEEE Transactions on Image Processing 25, 7 (July 2016), 3372--3383.

Digital Library

[27]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’12), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). 1097--1105.

Digital Library

[28]

H. Li, F. Meng, and K. N. Ngan. 2013. Co-salient object detection from multiple images. IEEE Transactions on Multimedia 15, 8 (December 2013), 1896--1909.

Digital Library

[29]

Li-Jia Li, Hao Su, Yongwhan Lim, and Li Fei-Fei. 2014. Object bank: An object-level image representation for high-level visual recognition. International Journal of Computer Vision 107, 1 (2014), 20--39.

Digital Library

[30]

Ping Li, Gennady Samorodnitsk, and John Hopcroft. 2013. Sign cauchy projections and chi-square kernel. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’13). 2571--2579.

Digital Library

[31]

Yao Li, Lingqiao Liu, Chunhua Shen, and Anton Van Den Hengel. 2017. Mining mid-level visual patterns with deep CNN activations. International Journal of Computer Vision 121, 3 (2017), 344--364.

Digital Library

[32]

Ce Liu, Jenny Yuen, and Antonio Torralba. 2009. Nonparametric scene parsing: Label transfer via dense scene alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1972--1979.

[33]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (November 2008), 2579--2605.

[34]

Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42, 3 (May 2001), 145--175.

Digital Library

[35]

Genevieve Patterson and James Hays. [n.d.]. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2751--2758.

Digital Library

[36]

Florent Perronnin, Jorge Sánchez, and Thomas Mensink. [n.d.]. Improving the Fisher kernel for large-scale image classification. In Proceedings of European Conference on Computer Vision (ECCV’10). 143--156.

Digital Library

[37]

Deepak Kumar Pradhan, Shikha Gupta, Veena Thenkanidiyoor, and Dileep Aroor Dinesh. 2017. Semantic multinomial representation for scene images using CNN-based pseudo-concepts and concept neural network. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics. Springer, 400--409.

[38]

Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 413--420.

[39]

Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia 9, 5 (2007), 923--938.

Digital Library

[40]

Nikhil Rasiwasia and Nuno Vasconcelos. 2012. Holistic context models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 5 (2012), 902--917.

Digital Library

[41]

Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (June 2013), 222--245.

Digital Library

[42]

Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. FOSNet: An end-to-end trainable deep neural network for scene recognition. IEEE Access 8 (2020), 82066--82077.

[43]

Krishan Sharma, Shikha Gupta, Aroor Dinesh Dileep, and Renu Rameshan. [n.d.]. Scene image classification using reduced virtual feature representation in sparse framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 2701--2705.

[44]

H. Shi, H. Li, F. Meng, Q. Wu, L. Xu, and K. N. Ngan. 2018. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Transactions on Multimedia 20, 10 (October 2018), 2670--2682.

[45]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint:1409.1556 (September 2014).

[46]

Chiranjibi Sitaula, Yong Xiang, Yushu Zhang, Xuequan Lu, and Sunil Aryal. 2019. Indoor image representation by high-level semantic features. IEEE Access 7 (2019), 84967--84979.

[47]

Ning Sun, Wenli Li, Jixin Liu, Guang Han, and Cong Wu. 2019. Fusing object semantics and deep appearance features for scene recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2019), 1715--1728.

[48]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.

[49]

Pengjie Tang, Hanli Wang, and Sam Kwong. 2017. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188--197.

Digital Library

[50]

Y. Tang, X. Wang, E. Dellandréa, and L. Chen. 2017. Weakly supervised learning of deformable part-based models for object detection via region proposals. IEEE Transactions on Multimedia 19, 2 (February 2017), 393--407.

Digital Library

[51]

Julia Vogel and Bernt Schiele. 2004. Natural scene retrieval based on a semantic modeling step. In Proceedings of the International Conference on Image and Video Retrieval (CIVR’04). 207--215.

[52]

Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395--416.

Digital Library

[53]

V. Wan and S. Renals. 2002. Evaluation of kernel methods for speaker verification and identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02). 669--672.

[54]

Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1287--1295.

Digital Library

[55]

Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3485--3492.

[56]

Guo-Sen Xie, Xu Zhang, Shuicheng Yan, and Cheng-Lin Liu. 2015. Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology 27, 6 (2015), 1263--1274.

Digital Library

[57]

Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.

[58]

Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).

[59]

Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2015. Multi-scale pyramid pooling for deep convolutional representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15) Workshops. 71--80.

[60]

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In Proceedings of the Deep Learning Workshop in International Conference on Machine Learning (ICML’15).

[61]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.

[62]

J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (October 2018), 2801--2813.

[63]

Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. [n.d.]. Deep semantic ranking based hashing for multi-label image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1556--1564.

[64]

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452--1464.

[65]

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’14). 487--495.

Digital Library

Cited By

Jiang YDale R(2025)Mapping the learning curves of deep learning networksPLOS Computational Biology10.1371/journal.pcbi.101228621:2(e1012286)Online publication date: 10-Feb-2025
https://doi.org/10.1371/journal.pcbi.1012286
Sun YLi PSun HXu HWang R(2025)Feature selection through adaptive sparse learning for scene recognitionApplied Soft Computing10.1016/j.asoc.2024.112439169:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.asoc.2024.112439
Giveki DEsfandyari S(2025)Semantic image representation for image recognition and retrieval using multilayer variational auto-encoder, InceptionNet and low-level image featuresThe Journal of Supercomputing10.1007/s11227-024-06792-581:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11227-024-06792-5
Show More Cited By

Index Terms

Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Kernel methods
        Support vector machines

Recommendations

Recognition of varying size scene images using semantic analysis of deep activation maps
Abstract
Understanding the complex semantic structure of scene images requires mapping the image from pixel space to high-level semantic space. In semantic space, a scene image is represented by the posterior probabilities of concepts (e.g., ‘car,’ ‘chair,’...
Deep CNN based pseudo-concept selection and modeling for generation of semantic multinomial representation of scene images
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data

Though recent convolutional neural network (CNN) based method for scene classification task show impressive results but lacks in capturing the complex semantic content of the scene images. To reduce the semantic gap a semantic multinomial (SMN) ...
The visual quality recognition of nonwovens using a novel wavelet based contourlet transform

In this paper, a novel wavelet based contourlet transform for texture extraction is presented. The visual quality recognition of nonwovens based on image processing approach can be considered as a special case of the application of computer vision and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 17, Issue 2

May 2021

410 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3461621

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2021

Accepted: 01 November 2020

Revised: 01 August 2020

Received: 01 September 2019

Published in TOMM Volume 17, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jiang YDale R(2025)Mapping the learning curves of deep learning networksPLOS Computational Biology10.1371/journal.pcbi.101228621:2(e1012286)Online publication date: 10-Feb-2025
https://doi.org/10.1371/journal.pcbi.1012286
Sun YLi PSun HXu HWang R(2025)Feature selection through adaptive sparse learning for scene recognitionApplied Soft Computing10.1016/j.asoc.2024.112439169:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.asoc.2024.112439
Giveki DEsfandyari S(2025)Semantic image representation for image recognition and retrieval using multilayer variational auto-encoder, InceptionNet and low-level image featuresThe Journal of Supercomputing10.1007/s11227-024-06792-581:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11227-024-06792-5
Feng WChunyang PYuqing LHao W(2024)A bat biomimetic model for scenario recognition using echo Doppler informationBioinspiration & Biomimetics10.1088/1748-3190/ad262d19:2(026015)Online publication date: 21-Feb-2024
https://doi.org/10.1088/1748-3190/ad262d
Zhou WHou YChen DHu HSu T(2023)Attention-Augmented Memory Network for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357016619:3(1-24)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3570166
Zhou WXia ZDou PSu THu H(2023)Aligning Image Semantics and Label Concepts for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027819:2(1-23)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3550278
Dong SNiu TLuo XLiu WXu X(2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3550276
Chen JPan YLi YYao TChao HMei T(2023)Retrieval Augmented Convolutional Encoder-decoder Networks for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353922519:1s(1-24)Online publication date: 23-Jan-2023
https://dl.acm.org/doi/10.1145/3539225
Zhou WXia ZDou PSu THu H(2023)Double Attention Based on Graph Attention Network for Image Multi-Label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351903019:1(1-23)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3519030
Zhang XShen MLi XWang X(2023)AABLSTM: A Novel Multi-task Based CNN-RNN Deep Model for Fashion AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351902919:1(1-18)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3519029
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents