Skip to main content

Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1542))

Abstract

The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete data. We propose a new architecture that integrates various levels of processing by using multiple representations of the visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene.

The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene representation. In this network spatial knowledge as well as knowledge extracted from psycholinguistic experiments is coded. First results show the robustness of our approach.

The work of G. Socher has been supported by the German Research Foundation (DFG).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Adorni, M. D. Manzo, and F. Giunchiglia. Natural language driven image generation. In COLING, pages 495–500, 1984.

    Google Scholar 

  2. L. E. Bernstein. For speech perception by humans or machines, three senses are better than one. In International Conference on Spoken Language Processing, pages 1477–1480, 1996.

    Google Scholar 

  3. S. Dickenson and D. Metaxas. Integrating qualitative and quantitative shape recovery. International Journal of Computer Vision, 13(3):1–20, 1994.

    Google Scholar 

  4. T. Fuhr, G. Socher, C. Scheering, and G. Sagerer. A three-dimensional spatial model for the interpretation of image data. In P. Olivier and K.-P. Gapp, editors, Representation and Processing of Spatial Expressions, pages 103–118. Lawrence Erlbaum Associates, 1997.

    Google Scholar 

  5. G. Heidemann and H. Ritter. Objekterkennung mit Neuronalen Netzen. Technical Report 2, Situierte Künstliche Kommunikatoren, SFB 360, Universität Bielefeld, 1996.

    Google Scholar 

  6. H. Kollnig and H.-H. Nagel. Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen. In Informatik Froschung und Entwicklung, 8, pages 186–196, 1993.

    Google Scholar 

  7. S. M. Kosslyn. Mental imagery. In D. A. O. et al, editor, Visual Cognition and Action, pages 73–7, Cambridge, Mass, 1990. MIT Press.

    Google Scholar 

  8. F. Kummert, G. A. Fink, and G. Sagerer. Schritthaltende hybride Objektdetektion. In Mustererkennung 97, 19, pages 137–44, Berlin, 1997. DAGM-Symposium Braunschweig, Springer-Verlag.

    Google Scholar 

  9. F. Lavagetto, S. Lepsoy, C. Braccini, and S. Curinga. Lip motion modeling and speech driven estimation. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 183–86, 1997.

    Google Scholar 

  10. K. Lee. Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers, 1989.

    Google Scholar 

  11. A. Maßmann, S. Posch, and D. Schlüter. Using markov random fields for contour-based grouping. In Proceedings of International Conference on Image Processing, volume 2, pages 207–42, 1997.

    Article  Google Scholar 

  12. T. Maybury, editor. Intelligent Multimedia Interfaces. AAAI Press/The MIT Press, 1993.

    Google Scholar 

  13. D. McDonald and E. J. Conklin. Salience as a simplifying metaphor for natural language generation. In Proceedings of AAAI-81, pages 49–51, 1981.

    Google Scholar 

  14. K. Nagao. Abduction and dynamic preference in plan-based dialogue understanding. In International Joint Conference on Artificial Intelligence, pages 1186–192. Morgan Kaufmann Publishers, Inc., 1993.

    Google Scholar 

  15. K. Nagao and J. Rekimoto. Ubiquitous talker: Spoken language interaction with real world objects. In International Joint Conference on Artificial Intelligence, pages 1284–290, 1995.

    Google Scholar 

  16. P. Olivier, T. Maeda, and J. ichi Tsujii. Automatic depiction of spatial descriptions. In Proceedings of AAAI-94, pages 1405–1410, Seattle, WA, 1994.

    Google Scholar 

  17. W. Richards, A. Jepson, and J. Feldman. Priors, preferences and categorial percepts. In W. Richards and D. Knill, editors, Perception as Bayesian Inference, pages 93–122. Cambridge University Press, 1996.

    Google Scholar 

  18. G. Socher. Qualitative Scene Descriptions from Images for Integrated Speech and Image Understanding. Dissertationen zur Künstlichen Intelligenz (DISKI 170). infix-Verlag, Sankt Augustin, 1997.

    Google Scholar 

  19. G. Socher, T. Merz, and S. Posch. 3-D Reconstruction and Camera Calibration from Images with Known Objects. In D. Pycock, editor, Proc. 6th British Machine Vision Conference, pages 167–176, 1995.

    Google Scholar 

  20. G. Socher, G. Sagerer, and P. Perona. Baysian Reasoning on Qualitative Descriptions from Images and Speech. In H. Buxton and A. Mukerjee, editors, ICCV’98 Workshop on Conceptual Description of Images, Bombay, India, 1998.

    Google Scholar 

  21. R. K. Srihari. Computational models for integrating linguistic and visual information: A survey. In Artificial Intelligence Review, 8, pages 349–369, Netherlands, 1994. Kluwer Academic Publishers.

    Article  Google Scholar 

  22. R. K. Srihari and D. T. Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In Proceedings of AAAI-94, pages 793–798, Seattle, WA, 1994.

    Google Scholar 

  23. J. K. Tsotsos and etal. The PLAYBOT Project. In J. Aronis, editor, IJCAI’Workshop on AI Applications for Disabled People, Montreal, 1995.

    Google Scholar 

  24. J. K. Tsotsos, G. Verghese, S. Dickenson, M. Jenkin, A. Jepson, E. Milios, F. Nuflo, S. Stevenson, M. Black, D. Metaxas, S. Culhane, Y. Yet, and R. Mann. Playbot: A visuallyguided robot for physically disabled children. Image and Vision Computing, 16(4):275–292, 1998.

    Article  Google Scholar 

  25. G. Verghese and J. K. Tsotsos. Real-time model-based tracking using perspective alignment. In Proceedings of Vision Interface’ pages 202–209, 1994.

    Google Scholar 

  26. C. Vorwerg, G. Socher, T. Fuhr, G. Sagerer, and G. Rickheit. Projective relations for 3D space: computational model, application, and psychological evaluation. In Proceedings of the 14th National Joint Conference on Artificial Intelligence AAAI-97, Rhode Island, 1997.

    Google Scholar 

  27. S. Wachsmuth, G. A. Fink, and G. Sagerer. Integration of parsing and incremental speech recognition. In Proceedings EUSIPCO-98, 1998.

    Google Scholar 

  28. W. Wahlster. One word says more than a thousand pictures. on the automatic verbalization of the results of image sequence analysis systems. In Computers and Artificial Intelligence, 8, pages 479–492, 1989.

    Google Scholar 

  29. D. L. Waltz. Generating and understanding scene descriptions. In B. Webber and I. Sag, editors, Elements of Discourse Understanding, pages 266–282, New York, NY, 1981. Cambridge University Press.

    Google Scholar 

  30. M. Zancanaro, O. Stock, and C. Strapparava. Dialogue cohension sharing and adjusting in an enhanced multimodal environment. In International Joint Conference on Artificial Intelligence, pages 1230–1236. Morgan Kaufmann Publishers, Inc., 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wachsmuth, S., Brandt-Pook, H., Socher, G., Kummert, F., Sagerer, G. (1999). Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks. In: Computer Vision Systems. ICVS 1999. Lecture Notes in Computer Science, vol 1542. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49256-9_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-49256-9_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65459-9

  • Online ISBN: 978-3-540-49256-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics