Abstract
Computational models of grounded language learning have been based on the premise that words and concepts are learned simultaneously. Given the mounting cognitive evidence for concept formation in infants, we argue that the availability of pre-lexical concepts (learned from image sequences) leads to considerable computational efficiency in word acquisition. Key to the process is a model of bottom-up visual attention in dynamic scenes. We have used existing work in background-foreground segmentation, multiple object tracking, object discovery and trajectory clustering to form object category and action concepts. The set of acquired concepts under visual attentive focus are then correlated with contemporaneous commentary to learn the grounded semantics of words and multi-word phrasal concatenations from the narrative. We demonstrate that even based on mere 5 minutes of video segments, a number of rudimentary visual concepts can be discovered. When these concepts are associated with unedited English commentary, we observe that several words emerge - more than 60% of the concepts discovered from the video are associated with correct language labels. Thus, the computational model imitates the beginning of language comprehension, based on attentional parsing of the visual data. Finally, the emergence of multi-word phrasal concatenations, a precursor to syntax, is observed where there are more salient referents than single words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)
Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1002–1009 (2004)
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1601–1608 (2011)
Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/Script: Alignment and Parsing of Video and Text Transcription. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 158–171. Springer, Heidelberg (2008)
Siskind, J.M.: Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research 15, 31–90 (2001)
Roy, D.K., Pentland, A.P.: Learning words from sights and sounds: a computational model. Cognitive Science 26, 113–146 (2002)
Dominey, P., Boucher, J.: Learning to talk about events from narrated video in the construction grammar framework. Artificial Intelligence 167, 31–61 (2005)
Madden, C., Hoen, M., Dominey, P.: A cognitive neuroscience perspective on embodied language for human-robot cooperation. Brain and Language 112, 180–188 (2010)
Yu, C., Ballard, D.H.: A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perception (2004)
Piaget, J.: The Construction of Reality in the Child. Basic Books (1994)
Mandler, J.M.: Foundations of Mind. Oxford University Press, New York (2004)
Quine, W.V.O.: Word and Object. John Wiley and Sons, New York (1960)
Itti, L., Koch, C.: Computational modeling of visual attention. Nature Reviews Neuroscience 2, 194–203 (2001)
Coldren, J.T., Haaf, R.A.: Priority of processing components of visual stimuli by 6-month-old infants. Infant Behavior and Development 22, 131–135 (1999)
Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 2, pp. 28–31 (2004)
Guha, P., Mukerjee, A., Subramanian, V.K.: Formulation, detection and application of occlusion states (oc-7) in the context of multiple object tracking. In: 8th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 1–6 (2011)
Nandi, S., Guha, P., Venkatesh, K.: Objects from animacy: Discovery in joint shape and haar feature space. In: Indian Conference on Vision, Graphics and Image Processing (2008)
Guha, P., Mukerjee, A., Venkatesh, K.S.: Activity Discovery Using Compressed Suffix Trees. In: Maino, G., Foresti, G.L. (eds.) ICIAP 2011, Part II. LNCS, vol. 6979, pp. 69–78. Springer, Heidelberg (2011)
Bloom, P.: How Children Learn the Meanings of Words. MIT Press, Cambridge (2000)
Sarkar, M., Mukerjee, A.: Perceptual theory of mind: An intermediary between visual salience and noun/verb acquisition. In: International Conference on Developmental Learning (ICDL 2006) (2006)
Mukerjee, A., Joshi, N., Mudgal, P., Srinath, S.: Bootstrapping word learning: A perception driven semantics first approach. In: IEEE International Conference on Development and Learning, vol. 2, pp. 1–6 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guha, P., Mukerjee, A. (2013). Unsupervised Language Learning for Discovered Visual Concepts. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds) Computer Vision – ACCV 2012. ACCV 2012. Lecture Notes in Computer Science, vol 7727. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37447-0_40
Download citation
DOI: https://doi.org/10.1007/978-3-642-37447-0_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37446-3
Online ISBN: 978-3-642-37447-0
eBook Packages: Computer ScienceComputer Science (R0)