Abstract
A notable characteristic of human cognition is its ability to derive reliable hypotheses in situations characterized by extreme uncertainty. Even in the absence of relevant knowledge to make a correct inference, humans are able to draw upon related knowledge to make an approximate inference that is semantically close to the correct inference. In the context of object recognition, this ability amounts to being able to hypothesize the identity of an object in an image without previously having ever seen any visual training examples of that object. The paradigm of low-shot (i.e., zero-shot and few-shot) classification has been traditionally used to address these situations. However, traditional zero-shot and few-shot approaches entail the training of classifiers in situations where a majority of classes are previously seen or visually observed whereas a minority of classes are previously unseen, in which case the classifiers for the unseen classes are learned by expressing them in terms of the classifiers for the seen classes. In this paper, we address the related but different problem of object recognition in situations where only a few object classes are visually observed whereas a majority of the object classes are previously unseen. Specifically, we pose the following questions: (a) Is it possible to hypothesize the identity of an object in an image without previously having seen any visual training examples for that object? and (b) Could the visual training examples of a few seen object classes provide reliable priors for hypothesizing the identities of objects in an image that belong to the majority unseen object classes? We propose a model for recognition of objects in an image in situations where visual classifiers are available for only a limited number of object classes. To this end, we leverage word embeddings trained on publicly available text corpora and use them as natural language priors for hypothesizing the identities of objects that belong to the unseen classes. Experimental results on the Microsoft Common Objects in Context (MS-COCO) data set show that it is possible to come up with reliable hypotheses with regard to object identities by exploiting word embeddings trained on the Wikipedia text corpus even in the absence of explicit visual classifiers for those object classes. To bolster our hypothesis, we conduct additional experiments on larger dataset of concepts (themes) that we created from the Conceptual Captions dataset. Even on this extremely challenging dataset, our results, though not entirely impressive, serve to provide an important proof-of-concept for the proposed model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Socher, R., et al.: Zero-shot learning through cross-modal transfer. In: Proceedings of NIPS, pp. 935–943 (2013)
Sharma, K., Kumar, A., Bhandarkar, S.: Guessing objects in context. In: Proceedings of the ACM SIGGRAPH, p. 83 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: Proceedings of the IEEE Conference on CVPR (2009)
Rabinovich, A., Belongie, S.: Scenes vs. objects: a comparative study of two approaches to context-based recognition. In: Proceedings of the International Workshop on Visual Scene Understanding, Miami, FL (2009)
Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-ocurrence, location and appearance. In: Proceedings of the IEEE Conference on CVPR (2008)
Malisiewicz, T., Efros, A.A.: Beyond categories: the visual memex model for reasoning about object relationships. In: Proceedings of NIPS (2009)
Torralba, A.: The context challenge (2020). http://web.mit.edu/torralba/www/carsAndFacesInContext.html. Accessed 16 Nov 2020
Heitz, G., Koller, D.: Learning spatial context: using stuff to find things. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 30–43. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_4
Choi, M., Torralba, A., Willsky, A.S.: A tree- based context model for object recognition. IEEE Trans. PAMI 34(2), 240–252 (2012)
Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from Abbey to Zoo. In: Proceedings of IEEE Conference on CVPR (2010)
Gkioxari, G.: Contextual Visual Recognition from Images and Videos. University of California, Berkeley (2016)
Sun, J., Jacobs, D.W.: Seeing what is not there: learning context to determine where objects are missing. arXiv preprint arXiv:1702.07971 (2017)
Zhang, Y., Bai, M., Kohli, P., Izadi, S., Xiao, J.: DeepContext: context-encoding neural pathways for 3D holistic scene understanding. arXiv preprint arXiv:1603.04922 (2016)
Gonzalez-Garcia, A., Modolo, D., Ferrari, V.: Objects as context for part detection. arXiv preprint arXiv:1703.09529 (2017)
Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_24
Rahman, S., Khan, S., Barnes, N.: Transductive learning for zero-shot object detection. In: Proceedings of ICCV, pp. 6082–6091 (2019)
Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of ICCV, pp. 4694–4703 (2019)
Wang, Y., Ramanan, D., Hebert, M.: Meta-learning to detect rare objects. In: Proceedings of IEEE ICCV, pp. 9925–9934 (2019)
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of IEEE Conference on CVPR, pp. 5831–5840 (2018)
Van Der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
Bojanowski, P., Grave, E., Joulin, A., Mikolov T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of ICLR (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2014)
Preiss, J., Dehdari, J., King, J., Mehay, D.: Refining the most frequent sense baseline. In: Proceedings of ACL Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pp. 10–18 (2009)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the Conference on ACL 1, pp. 2556–2565 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Sharma, K., Dandu, H., Kumar, A.C.S., Kumar, V., Bhandarkar, S.M. (2021). Exploiting Word Embeddings for Recognition of Previously Unseen Objects. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-68780-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)