Skip to main content

Unsupervised Language Learning for Discovered Visual Concepts

  • Conference paper
Book cover Computer Vision – ACCV 2012 (ACCV 2012)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7727))

Included in the following conference series:

  • 4763 Accesses

Abstract

Computational models of grounded language learning have been based on the premise that words and concepts are learned simultaneously. Given the mounting cognitive evidence for concept formation in infants, we argue that the availability of pre-lexical concepts (learned from image sequences) leads to considerable computational efficiency in word acquisition. Key to the process is a model of bottom-up visual attention in dynamic scenes. We have used existing work in background-foreground segmentation, multiple object tracking, object discovery and trajectory clustering to form object category and action concepts. The set of acquired concepts under visual attentive focus are then correlated with contemporaneous commentary to learn the grounded semantics of words and multi-word phrasal concatenations from the narrative. We demonstrate that even based on mere 5 minutes of video segments, a number of rudimentary visual concepts can be discovered. When these concepts are associated with unedited English commentary, we observe that several words emerge - more than 60% of the concepts discovered from the video are associated with correct language labels. Thus, the computational model imitates the beginning of language comprehension, based on attentional parsing of the visual data. Finally, the emergence of multi-word phrasal concatenations, a precursor to syntax, is observed where there are more salient referents than single words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)

    MATH  Google Scholar 

  2. Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1002–1009 (2004)

    Google Scholar 

  3. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1601–1608 (2011)

    Google Scholar 

  4. Siddiquie, B., Gupta, A.: Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2010)

    Google Scholar 

  5. Quattoni, A., Collins, M., Darrell, T.: Learning visual representations using images with captions. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2007)

    Google Scholar 

  6. Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/Script: Alignment and Parsing of Video and Text Transcription. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 158–171. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Siskind, J.M.: Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research 15, 31–90 (2001)

    MATH  Google Scholar 

  8. Roy, D.K., Pentland, A.P.: Learning words from sights and sounds: a computational model. Cognitive Science 26, 113–146 (2002)

    Article  Google Scholar 

  9. Dominey, P., Boucher, J.: Learning to talk about events from narrated video in the construction grammar framework. Artificial Intelligence 167, 31–61 (2005)

    Article  Google Scholar 

  10. Madden, C., Hoen, M., Dominey, P.: A cognitive neuroscience perspective on embodied language for human-robot cooperation. Brain and Language 112, 180–188 (2010)

    Article  Google Scholar 

  11. Yu, C., Ballard, D.H.: A multimodal learning interface for grounding spoken language in sensory perceptions. ACM Transactions on Applied Perception (2004)

    Google Scholar 

  12. Piaget, J.: The Construction of Reality in the Child. Basic Books (1994)

    Google Scholar 

  13. Mandler, J.M.: Foundations of Mind. Oxford University Press, New York (2004)

    Google Scholar 

  14. Quine, W.V.O.: Word and Object. John Wiley and Sons, New York (1960)

    MATH  Google Scholar 

  15. Itti, L., Koch, C.: Computational modeling of visual attention. Nature Reviews Neuroscience 2, 194–203 (2001)

    Article  Google Scholar 

  16. Coldren, J.T., Haaf, R.A.: Priority of processing components of visual stimuli by 6-month-old infants. Infant Behavior and Development 22, 131–135 (1999)

    Article  Google Scholar 

  17. Zivkovic, Z.: Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 2, pp. 28–31 (2004)

    Google Scholar 

  18. Guha, P., Mukerjee, A., Subramanian, V.K.: Formulation, detection and application of occlusion states (oc-7) in the context of multiple object tracking. In: 8th IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 1–6 (2011)

    Google Scholar 

  19. Nandi, S., Guha, P., Venkatesh, K.: Objects from animacy: Discovery in joint shape and haar feature space. In: Indian Conference on Vision, Graphics and Image Processing (2008)

    Google Scholar 

  20. Guha, P., Mukerjee, A., Venkatesh, K.S.: Activity Discovery Using Compressed Suffix Trees. In: Maino, G., Foresti, G.L. (eds.) ICIAP 2011, Part II. LNCS, vol. 6979, pp. 69–78. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  21. Bloom, P.: How Children Learn the Meanings of Words. MIT Press, Cambridge (2000)

    Google Scholar 

  22. Sarkar, M., Mukerjee, A.: Perceptual theory of mind: An intermediary between visual salience and noun/verb acquisition. In: International Conference on Developmental Learning (ICDL 2006) (2006)

    Google Scholar 

  23. Mukerjee, A., Joshi, N., Mudgal, P., Srinath, S.: Bootstrapping word learning: A perception driven semantics first approach. In: IEEE International Conference on Development and Learning, vol. 2, pp. 1–6 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guha, P., Mukerjee, A. (2013). Unsupervised Language Learning for Discovered Visual Concepts. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds) Computer Vision – ACCV 2012. ACCV 2012. Lecture Notes in Computer Science, vol 7727. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37447-0_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37447-0_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37446-3

  • Online ISBN: 978-3-642-37447-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics