Aggregating Rich Deep Semantic Features for Fine-Grained Place Classification

Wei, Tingyu; Hu, Wenxin; Wu, Xingjiao; Zheng, Yingbin; Ye, Hao; Yang, Jing; He, Liang

doi:10.1007/978-3-030-30508-6_5

Tingyu Wei¹²,
Wenxin Hu¹²,
Xingjiao Wu¹²,
Yingbin Zheng¹³,
Hao Ye¹³,
Jing Yang¹² &
…
Liang He¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11729))

Included in the following conference series:

International Conference on Artificial Neural Networks

2669 Accesses

Abstract

This paper proposes a method that aggregates rich deep semantic features for fine-grained place classification. As is known to all, the category of images depends on the objects and text as well as the various semantic regions, hierarchical structure, and spatial layout. However, most recently designed fine-grained classification systems ignored this, the complex multi-level semantic structure of images associated with fine-grained classes has not yet been well explored. Therefore, in this work, our approach composed of two modules: Content Estimator (CNE) and Context Estimator (CXE). CNE generates deep content features by encoding global visual cues of images. CXE obtains rich context features of images, and it consists of three children Estimator: Text Context Estimator (TCE), Object Context Estimator (OCE), and Scene Context Estimator (SCE). When inputting an image into CXE, TCE encodes text cues to identify word-level semantic information, OCE extracts high-dimensional feature then maps it to object semantic information, SCE gains hierarchical structure and spatial layout information by recognizing scene cues. To aggregate rich deep semantic features, we fuse the information about CNE and CXE for fine-grained classification. To the best of our knowledge, this is the first work to leverage the text information from an arbitrary-oriented scene text detector for extracting context information. Moreover, our method explores the fusion of semantic features and demonstrates scene features to give more complementary information with the other cues. Furthermore, the proposed approach achieves state-of-the-art performance on a fine-grained classification dataset, 84.3% on Con-Text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911–2918 (2012). https://doi.org/10.1109/cvpr.2012.6248018
Bai, X., Yang, M., Lyu, P., Xu, Y., Luo, J.: Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6, 66322–66335 (2018). https://doi.org/10.1109/access.2018.2878899
Article Google Scholar
Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952 (2014)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011). https://doi.org/10.1145/1961189.1961199
Article Google Scholar
Gavves, E., Fernando, B., Snoek, C.G., Smeulders, A.W., Tuytelaars, T.: Fine-grained categorization by alignments. In: International Conference on Computer Vision (ICCV), pp. 1713–1720 (2013). https://doi.org/10.1109/iccv.2013.215
Gavves, E., Fernando, B., Snoek, C.G., Smeulders, A.W., Tuytelaars, T.: Local alignments for fine-grained categorization. Int. J. Comput. Vis. 111, 191–212 (2015). https://doi.org/10.1007/s11263-014-0741-5
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://doi.org/10.1109/cvpr.2016.90
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv Preprint arXiv:1406.2227 (2014)
Karaoglu, S., van Gemert, J.C., Gevers, T.: Con-text: text detection using background connectivity for fine-grained object classification. In: ACM International Conference on Multimedia (MM), pp. 757–760 (2013). https://doi.org/10.1145/2502081.2502197
Karaoglu, S., Tao, R., van Gemert, J.C., Gevers, T.: Con-text: text detection for fine-grained object classification. IEEE Trans. Image Process. 26, 3965–3980 (2017). https://doi.org/10.1109/tip.2017.2707805
Article MathSciNet MATH Google Scholar
Karaoglu, S., Tao, R., Gevers, T., Smeulders, A.W.: Words matter: scene text for image classification and retrieval. IEEE Trans. Multimedia 19, 1063–1076 (2017). https://doi.org/10.1109/tmm.2016.2638622
Article Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 (2015). https://doi.org/10.1109/icdar.2015.7333942
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2169–2178 (2006). https://doi.org/10.1109/cvpr.2006.68
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 4161–4167 (2017)
Google Scholar
Liu, J., Kanazawa, A., Jacobs, D., Belhumeur, P.: Dog breed classification using part localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 172–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_13
Chapter Google Scholar
Ma, J., et al.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 3111–3122 (2018). https://doi.org/10.1109/tmm.2018.2818020
Article Google Scholar
Movshovitz-Attias, Y., Yu, Q., Stumpe, M.C., Shet, V., Arnoud, S., Yatziv, L.: Ontological supervision for fine grained classification of street view storefronts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1693–1702 (2015). https://doi.org/10.1109/cvpr.2015.7298778
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 722–729 (2008). https://doi.org/10.1109/icvgip.2008.47
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 145–175 (2001)
Article Google Scholar
Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 413–420 (2009). https://doi.org/10.1109/cvprw.2009.5206537
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv Preprint arXiv:1409.1556 (2014)
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 776–789. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_56
Chapter Google Scholar
Vogel, J., Schiele, B.: Semantic modeling of natural scenes for content-based image retrieval. Int. J. Comput. Vision 72(2), 133–157 (2007). https://doi.org/10.1007/s11263-006-8614-1
Article Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492 (2010). https://doi.org/10.1109/cvpr.2010.5539970
Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 842–850 (2015). https://doi.org/10.1109/cvpr.2015.7298685
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_54
Chapter Google Scholar
Zheng, Y., Jiang, Y.-G., Xue, X.: Learning hybrid part filters for scene recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 172–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_13
Chapter Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452–1464 (2017). https://doi.org/10.1109/tpami.2017.2723009
Article Google Scholar
Zhou, X., et al.: East: an efficient and accurate scene text detector. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651 (2017). https://doi.org/10.1109/cvpr.2017.283

Download references

Acknowledgement

This research is funded by the Science and Technology Commission of Shanghai Municipality (No. 18511103105) and by Xiaoi Research. The computation is performed in ECNU Public Platform for Innovation (001).

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Tingyu Wei, Wenxin Hu, Xingjiao Wu, Jing Yang & Liang He
Videt Tech, Shanghai, China
Yingbin Zheng & Hao Ye

Authors

Tingyu Wei
View author publications
You can also search for this author in PubMed Google Scholar
Wenxin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Xingjiao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yingbin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hao Ye
View author publications
You can also search for this author in PubMed Google Scholar
Jing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenxin Hu .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Praha 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, T. et al. (2019). Aggregating Rich Deep Semantic Features for Fine-Grained Place Classification. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing. ICANN 2019. Lecture Notes in Computer Science(), vol 11729. Springer, Cham. https://doi.org/10.1007/978-3-030-30508-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-30508-6_5
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30507-9
Online ISBN: 978-3-030-30508-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics