Unsupervised Language Learning for Discovered Visual Concepts

Guha, Prithwijit; Mukerjee, Amitabha

doi:10.1007/978-3-642-37447-0_40

Prithwijit Guha²⁰ &
Amitabha Mukerjee²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7727))

Included in the following conference series:

Asian Conference on Computer Vision

4763 Accesses

Abstract

Computational models of grounded language learning have been based on the premise that words and concepts are learned simultaneously. Given the mounting cognitive evidence for concept formation in infants, we argue that the availability of pre-lexical concepts (learned from image sequences) leads to considerable computational efficiency in word acquisition. Key to the process is a model of bottom-up visual attention in dynamic scenes. We have used existing work in background-foreground segmentation, multiple object tracking, object discovery and trajectory clustering to form object category and action concepts. The set of acquired concepts under visual attentive focus are then correlated with contemporaneous commentary to learn the grounded semantics of words and multi-word phrasal concatenations from the narrative. We demonstrate that even based on mere 5 minutes of video segments, a number of rudimentary visual concepts can be discovered. When these concepts are associated with unedited English commentary, we observe that several words emerge - more than 60% of the concepts discovered from the video are associated with correct language labels. Thus, the computational model imitates the beginning of language comprehension, based on attentional parsing of the visual data. Finally, the emergence of multi-word phrasal concatenations, a precursor to syntax, is observed where there are more salient referents than single words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
MATH Google Scholar
Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1002–1009 (2004)
Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1601–1608 (2011)
Google Scholar
Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Google Scholar
Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Google Scholar
Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/Script: Alignment and Parsing of Video and Text Transcription. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 158–171. Springer, Heidelberg (2008)
Chapter Google Scholar
Siskind, J.M.: Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research 15, 31–90 (2001)
MATH Google Scholar
Roy, D.K., Pentland, A.P.: Learning words from sights and sounds: a computational model. Cognitive Science 26, 113–146 (2002)
Article Google Scholar
Dominey, P., Boucher, J.: Learning to talk about events from narrated video in the construction grammar framework. Artificial Intelligence 167, 31–61 (2005)
Article Google Scholar
Madden, C., Hoen, M., Dominey, P.: A cognitive neuroscience perspective on embodied language for human-robot cooperation. Brain and Language 112, 180–188 (2010)
Article Google Scholar
Yu, C., Ballard, D.H.: A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perception (2004)
Google Scholar
Piaget, J.: The Construction of Reality in the Child. Basic Books (1994)
Google Scholar
Mandler, J.M.: Foundations of Mind. Oxford University Press, New York (2004)
Google Scholar
Quine, W.V.O.: Word and Object. John Wiley and Sons, New York (1960)
MATH Google Scholar
Itti, L., Koch, C.: Computational modeling of visual attention. Nature Reviews Neuroscience 2, 194–203 (2001)
Article Google Scholar
Coldren, J.T., Haaf, R.A.: Priority of processing components of visual stimuli by 6-month-old infants. Infant Behavior and Development 22, 131–135 (1999)
Article Google Scholar
Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 2, pp. 28–31 (2004)
Google Scholar
Guha, P., Mukerjee, A., Subramanian, V.K.: Formulation, detection and application of occlusion states (oc-7) in the context of multiple object tracking. In: 8th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 1–6 (2011)
Google Scholar
Nandi, S., Guha, P., Venkatesh, K.: Objects from animacy: Discovery in joint shape and haar feature space. In: Indian Conference on Vision, Graphics and Image Processing (2008)
Google Scholar
Guha, P., Mukerjee, A., Venkatesh, K.S.: Activity Discovery Using Compressed Suffix Trees. In: Maino, G., Foresti, G.L. (eds.) ICIAP 2011, Part II. LNCS, vol. 6979, pp. 69–78. Springer, Heidelberg (2011)
Chapter Google Scholar
Bloom, P.: How Children Learn the Meanings of Words. MIT Press, Cambridge (2000)
Google Scholar
Sarkar, M., Mukerjee, A.: Perceptual theory of mind: An intermediary between visual salience and noun/verb acquisition. In: International Conference on Developmental Learning (ICDL 2006) (2006)
Google Scholar
Mukerjee, A., Joshi, N., Mudgal, P., Srinath, S.: Bootstrapping word learning: A perception driven semantics first approach. In: IEEE International Conference on Development and Learning, vol. 2, pp. 1–6 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics & Electrical Engineering, IIT Guwahati, India
Prithwijit Guha
Department of Computer Science & Engineering, IIT Kanpur, India
Amitabha Mukerjee

Authors

Prithwijit Guha
View author publications
You can also search for this author in PubMed Google Scholar
Amitabha Mukerjee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, 151-744, Seoul, Korea
Kyoung Mu Lee
Microsoft Research Asia, No. 5, Danling st., Haidian district, 100080, Beijing, P.R. China
Yasuyuki Matsushita
School of Interactive Computing, Georgia Institute of Technology, 801 Atlantic Drive, CCB 315, 30332, Atlanta, GA, USA
James M. Rehg
Institute of Automation, National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Zhong Quan Cun East Road 95, Haidian District,, 100 190, Beijing, P.R. China
Zhanyi Hu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guha, P., Mukerjee, A. (2013). Unsupervised Language Learning for Discovered Visual Concepts. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds) Computer Vision – ACCV 2012. ACCV 2012. Lecture Notes in Computer Science, vol 7727. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37447-0_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-37447-0_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37446-3
Online ISBN: 978-3-642-37447-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics