Abstract
This paper proposes a method that aggregates rich deep semantic features for fine-grained place classification. As is known to all, the category of images depends on the objects and text as well as the various semantic regions, hierarchical structure, and spatial layout. However, most recently designed fine-grained classification systems ignored this, the complex multi-level semantic structure of images associated with fine-grained classes has not yet been well explored. Therefore, in this work, our approach composed of two modules: Content Estimator (CNE) and Context Estimator (CXE). CNE generates deep content features by encoding global visual cues of images. CXE obtains rich context features of images, and it consists of three children Estimator: Text Context Estimator (TCE), Object Context Estimator (OCE), and Scene Context Estimator (SCE). When inputting an image into CXE, TCE encodes text cues to identify word-level semantic information, OCE extracts high-dimensional feature then maps it to object semantic information, SCE gains hierarchical structure and spatial layout information by recognizing scene cues. To aggregate rich deep semantic features, we fuse the information about CNE and CXE for fine-grained classification. To the best of our knowledge, this is the first work to leverage the text information from an arbitrary-oriented scene text detector for extracting context information. Moreover, our method explores the fusion of semantic features and demonstrates scene features to give more complementary information with the other cues. Furthermore, the proposed approach achieves state-of-the-art performance on a fine-grained classification dataset, 84.3% on Con-Text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911–2918 (2012). https://doi.org/10.1109/cvpr.2012.6248018
Bai, X., Yang, M., Lyu, P., Xu, Y., Luo, J.: Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6, 66322–66335 (2018). https://doi.org/10.1109/access.2018.2878899
Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952 (2014)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011). https://doi.org/10.1145/1961189.1961199
Gavves, E., Fernando, B., Snoek, C.G., Smeulders, A.W., Tuytelaars, T.: Fine-grained categorization by alignments. In: International Conference on Computer Vision (ICCV), pp. 1713–1720 (2013). https://doi.org/10.1109/iccv.2013.215
Gavves, E., Fernando, B., Snoek, C.G., Smeulders, A.W., Tuytelaars, T.: Local alignments for fine-grained categorization. Int. J. Comput. Vis. 111, 191–212 (2015). https://doi.org/10.1007/s11263-014-0741-5
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://doi.org/10.1109/cvpr.2016.90
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv Preprint arXiv:1406.2227 (2014)
Karaoglu, S., van Gemert, J.C., Gevers, T.: Con-text: text detection using background connectivity for fine-grained object classification. In: ACM International Conference on Multimedia (MM), pp. 757–760 (2013). https://doi.org/10.1145/2502081.2502197
Karaoglu, S., Tao, R., van Gemert, J.C., Gevers, T.: Con-text: text detection for fine-grained object classification. IEEE Trans. Image Process. 26, 3965–3980 (2017). https://doi.org/10.1109/tip.2017.2707805
Karaoglu, S., Tao, R., Gevers, T., Smeulders, A.W.: Words matter: scene text for image classification and retrieval. IEEE Trans. Multimedia 19, 1063–1076 (2017). https://doi.org/10.1109/tmm.2016.2638622
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 (2015). https://doi.org/10.1109/icdar.2015.7333942
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2169–2178 (2006). https://doi.org/10.1109/cvpr.2006.68
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 4161–4167 (2017)
Liu, J., Kanazawa, A., Jacobs, D., Belhumeur, P.: Dog breed classification using part localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 172–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_13
Ma, J., et al.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 3111–3122 (2018). https://doi.org/10.1109/tmm.2018.2818020
Movshovitz-Attias, Y., Yu, Q., Stumpe, M.C., Shet, V., Arnoud, S., Yatziv, L.: Ontological supervision for fine grained classification of street view storefronts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1693–1702 (2015). https://doi.org/10.1109/cvpr.2015.7298778
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 722–729 (2008). https://doi.org/10.1109/icvgip.2008.47
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 145–175 (2001)
Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 413–420 (2009). https://doi.org/10.1109/cvprw.2009.5206537
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv Preprint arXiv:1409.1556 (2014)
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Torresani, L., Szummer, M., Fitzgibbon, A.: Efficient object category recognition using classemes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 776–789. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_56
Vogel, J., Schiele, B.: Semantic modeling of natural scenes for content-based image retrieval. Int. J. Comput. Vision 72(2), 133–157 (2007). https://doi.org/10.1007/s11263-006-8614-1
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492 (2010). https://doi.org/10.1109/cvpr.2010.5539970
Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 842–850 (2015). https://doi.org/10.1109/cvpr.2015.7298685
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_54
Zheng, Y., Jiang, Y.-G., Xue, X.: Learning hybrid part filters for scene recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 172–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_13
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452–1464 (2017). https://doi.org/10.1109/tpami.2017.2723009
Zhou, X., et al.: East: an efficient and accurate scene text detector. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651 (2017). https://doi.org/10.1109/cvpr.2017.283
Acknowledgement
This research is funded by the Science and Technology Commission of Shanghai Municipality (No. 18511103105) and by Xiaoi Research. The computation is performed in ECNU Public Platform for Innovation (001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wei, T. et al. (2019). Aggregating Rich Deep Semantic Features for Fine-Grained Place Classification. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing. ICANN 2019. Lecture Notes in Computer Science(), vol 11729. Springer, Cham. https://doi.org/10.1007/978-3-030-30508-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-30508-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30507-9
Online ISBN: 978-3-030-30508-6
eBook Packages: Computer ScienceComputer Science (R0)