Skip to main content
Log in

Recognition of varying size scene images using semantic analysis of deep activation maps

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Understanding the complex semantic structure of scene images requires mapping the image from pixel space to high-level semantic space. In semantic space, a scene image is represented by the posterior probabilities of concepts (e.g., ‘car,’ ‘chair,’ ‘window,’ etc.) present in it and such representation is known as semantic multinomial (SMN) representation. SMN generation requires a concept annotated dataset for concept modeling which is infeasible to generate manually due to the large size of databases. To tackle this issue, we propose a novel approach of building the concept model via pseudo-concepts. Pseudo-concept acts as a proxy for the actual concept and gives the cue for its presence instead of actual identity. We propose to use filter responses from deeper convolutional layers of convolutional neural networks (CNNs) as pseudo-concepts, as filters in deeper convolutional layers are trained for different semantic concepts. Most of the prior work considers fixed-size (\(\approx \)227\(\times \)227) images for semantic analysis which suppresses many concepts present in the images. In this work, we preserve the true-concept structure in images by passing in their original resolution to convolutional layers of CNNs. We further propose to prune the non-prominent pseudo-concepts, group the similar one using kernel clustering and later model them using a dynamic-based support vector machine. We demonstrate that resulting SMN representation indeed captures the semantic concepts better and results in state-of-the-art classification accuracy on varying size scene image datasets such as MIT67 and SUN397.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/BVLC/caffe/wiki/Model-Zoo.

References

  1. Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: quantifying interpretability of deep visual representations, pp. 3319–3327 (2017)

  2. Chatfield, K., Lempitsky, V.S., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of the British Machine Vision Conference (BMVC 2011), Dundee, Scotland, vol. 2, p. 8 (2011)

  3. Cheng, X., Lu, J., Feng, J., Yuan, B., Zhou, J.: Scene recognition with objectness. Pattern Recogn. 74, 474–487 (2018)

    Article  Google Scholar 

  4. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Proceedings of Workshop on Statistical Learning in Computer Vision (ECCV 2004), Prague, Czech Republic, vol. 1, pp. 1–2 (2004)

  5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Florida, USA, pp. 248–255 (2009)

  6. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 551–556 (2004)

  7. Dixit, M., Chen, S., Gao, D., Rasiwasia, N., Vasconcelos, N.: Scene classification with semantic Fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, Massachusetts, pp. 2974–2983, https://doi.org/10.1109/CVPR.2015.7298916 (2015)

  8. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: Proceedings of the International Conference on Machine Learning (ICML 2014), Beijing, China, pp. 647–655 (2014)

  9. Fernando, B., Fromont, E., Tuytelaars, T.: Mining mid-level features for image classification. Int. J. Comput. Vis. 108(3), 186–203 (2014)

    Article  MathSciNet  Google Scholar 

  10. Fong, R., Vedaldi, A.: Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks (2018) Preprint arXiv:1801.03454

  11. Gao, B.B., Wei, X.S., Wu, J., Lin, W.: Deep spatial pyramid: the devil is once again in the details (2015). Preprint arXiv:1504.05277

  12. Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of European Conference on Computer Vision (ECCV 2014), Zurich, pp. 392–407 (2014)

  13. Gupta, S., Dileep, A.D., Thenkanidiyoor, V.: Segment-level pyramid match kernels for the classification of varying length patterns of speech using svms. In: Proceedings of the European Signal Processing Conference (EUSIPCO 2016), Budapest, Hungary, pp. 2030–2034 (2016)

  14. Gupta, S., Dileep, A.D., Thenkanidiyoor, V.: The semantic multinomial representation of images obtained using dynamic kernel based pseudo-concept SVMs. In: Proceedings of National Conference on Communication (NCC 2017), Chennai, India, pp. 1–6 (2017)

  15. Gupta, S., Dinesh, D.A., Thenkanidiyoor, V.: Deep cnn based pseudo-concept selection and modeling for generation of semantic multinomial representation of scene images. In: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pp. 336–339 (2018a)

  16. Gupta, S., Pradhan, D.K., Aroor, Dinesh D., Thenkanidiyoor, V.: Deep spatial pyramid match kernel for scene classification. In: International Conference on Pattern Recognition Applications and Methods ICPRAM, pp. 141–148 (2018b)

  17. Gupta, S., Karanath, A., Mahrifa, K., Dileep, A.D., Thenkanidiyoor, V.: Segment-level probabilistic sequence kernel and segment-level pyramid match kernel based extreme learning machine for classification of varying length patterns of speech. Int. J. Speech Technol. 22(1), 231–249 (2019)

    Article  Google Scholar 

  18. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)

    Article  Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016, Las Vegas, USA, pp. 770–778 (2016)

  20. Henderson, J.: Introduction to real-world scene perception. Vis. Cogn. 12(6), 849–851 (2005)

    Article  Google Scholar 

  21. Herranz, L., Jiang, S., Li, X.: Scene recognition with CNNs: objects, scales and dataset bias. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, USA, pp. 571–579 (2016)

  22. Jiang, S., Chen, G., Song, X., Liu, L.: Deep patch representations with shared codebook for scene classification. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 15(1s), 5 (2019)

    Google Scholar 

  23. Khan, S.H., Hayat, M., Bennamoun, M., Togneri, R., Sohel, F.A.: A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans. Image Process. 25(7), 3372–3383 (2016)

    Article  MathSciNet  Google Scholar 

  24. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS 2012), Nevada, USA, pp. 1097–1105 (2012)

  25. Li, L.J., Su, H., Lim, Y., Fei-Fei, L.: Object bank: an object-level image representation for high-level visual recognition. Int. J. Comput. Vis. 107(1), 20–39 (2014). https://doi.org/10.1007/s11263-013-0660-x

    Article  Google Scholar 

  26. Li, P., Samorodnitsk, G., Hopcroft, J.: Sign cauchy projections and chi-square kernel. In: Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS 2013), Harrah’s Lake Tahoe, USA, pp. 2571–2579 (2013)

  27. Li, Y., Liu, L., Shen, C., Van Den Hengel, A.: Mining mid-level visual patterns with deep cnn activations. Int. J. Comput. Vis. 121(3), 344–364 (2017)

    Article  MathSciNet  Google Scholar 

  28. Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing: Label transfer via dense scene alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Florida, USA, pp. 1972–1979 (2009)

  29. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  30. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)

    Article  Google Scholar 

  31. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Florida, USA, pp. 413–420 (2009)

  32. Rasiwasia, N., Vasconcelos, N.: Holistic context models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 902–917 (2012)

    Article  Google Scholar 

  33. Rasiwasia, N., Moreno, P.J., Vasconcelos, N.: Bridging the gap: query by semantic example. IEEE Trans. Multimed. 9(5), 923–938 (2007)

    Article  Google Scholar 

  34. Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105(3), 222–245 (2013)

    Article  MathSciNet  Google Scholar 

  35. Seong, H., Hyun, J., Kim, E.: Fosnet: an end-to-end trainable deep neural network for scene recognition. IEEE Access 8, 82066–82077 (2020)

    Article  Google Scholar 

  36. Sharma, K., Gupta, S., Dileep, A.D., Rameshan, R.: Scene image classification using reduced virtual feature representation in sparse framework. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2701–2705 (2018)

  37. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition.Preprint arXiv:1409.1556 (2014)

  38. Sitaula, C., Xiang, Y., Zhang, Y., Lu, X., Aryal, S.: Indoor image representation by high-level semantic features. IEEE Access 7, 84967–84979 (2019)

    Article  Google Scholar 

  39. Song, X., Jiang, S., Herranz, L.: Multi-scale multi-feature context modeling for scene recognition in the semantic manifold. IEEE Trans. Image Process. 26(6), 2721–2735 (2017)

    Article  MathSciNet  Google Scholar 

  40. Sun, N., Li, W., Liu, J., Han, G., Wu, C.: Fusing object semantics and deep appearance features for scene recognition. IEEE Trans. Circuits Syst. Video Technol. 29(6), 1715–1728 (2019)

    Article  Google Scholar 

  41. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, Massachusetts, pp. 1–9 (2015)

  42. Tang, P., Wang, H., Kwong, S.: G-ms2f: Googlenet based multi-stage feature fusion of deep cnn for scene recognition. Neurocomputing 225, 188–197 (2017)

    Article  Google Scholar 

  43. Vogel, J., Schiele, B.: Natural scene retrieval based on a semantic modeling step. In: Proceedings of the International Conference on Image and Video Retrieval (CIVR 2004), Dublin, Ireland, pp. 207–215 (2004)

  44. Wu, R., Wang, B., Wang, W., Yu, Y.: Harvesting discriminative meta objects with deep cnn features for scene classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, pp. 1287–1295 (2015)

  45. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), San Francisco, CA, pp. 3485–3492 (2010)

  46. Xie, G.S., Zhang, X.Y., Yan, S., Liu, C.L.: Hybrid cnn and dictionary-based models for scene recognition and domain adaptation. IEEE Trans. Circuits Syst. Video Technol. 27(6), 1263–1274 (2015)

    Article  Google Scholar 

  47. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Florida, USA, pp. 1794–1801 (2009)

  48. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. In: Proceedings of the Deep Learning Workshop in International Conference on Machine Learning (ICML 2015) (2015)

  49. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Proceedings of the European Conference on Computer Vision (ECCV 2014), Zurich, pp. 818–833 (2014)

  50. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS 2014), Montreal, Canada, pp. 487–495 (2014)

  51. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shikha Gupta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gupta, S., Dileep, A.D. & Thenkanidiyoor, V. Recognition of varying size scene images using semantic analysis of deep activation maps. Machine Vision and Applications 32, 52 (2021). https://doi.org/10.1007/s00138-021-01168-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-021-01168-8

Keywords

Navigation