Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks

Wachsmuth, Sven; Brandt-Pook, Hans; Socher, Gudrun; Kummert, Franz; Sagerer, Gerhard

doi:10.1007/3-540-49256-9_15

Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks

Sven Wachsmuth⁵,
Hans Brandt-Pook⁵,
Gudrun Socher⁵^nAff6,
Franz Kummert⁵ &
…
Gerhard Sagerer⁵

Conference paper
First Online: 01 January 2002

519 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1542))

Abstract

The interaction of image and speech processing is a crucial property of multimedia systems. Classical systems using inferences on pure qualitative high level descriptions miss a lot of information when concerned with erroneous, vague, or incomplete data. We propose a new architecture that integrates various levels of processing by using multiple representations of the visually observed scene. They are vertically connected by Bayesian networks in order to find the most plausible interpretation of the scene.

The interpretation of a spoken utterance naming an object in the visually observed scene is modeled as another partial representation of the scene. Using this concept, the key problem is the identification of the verbally specified object instances in the visually observed scene. Therefore, a Bayesian network is generated dynamically from the spoken utterance and the visual scene representation. In this network spatial knowledge as well as knowledge extracted from psycholinguistic experiments is coded. First results show the robustness of our approach.

The work of G. Socher has been supported by the German Research Foundation (DFG).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

G. Adorni, M. D. Manzo, and F. Giunchiglia. Natural language driven image generation. In COLING, pages 495–500, 1984.
Google Scholar
L. E. Bernstein. For speech perception by humans or machines, three senses are better than one. In International Conference on Spoken Language Processing, pages 1477–1480, 1996.
Google Scholar
S. Dickenson and D. Metaxas. Integrating qualitative and quantitative shape recovery. International Journal of Computer Vision, 13(3):1–20, 1994.
Google Scholar
T. Fuhr, G. Socher, C. Scheering, and G. Sagerer. A three-dimensional spatial model for the interpretation of image data. In P. Olivier and K.-P. Gapp, editors, Representation and Processing of Spatial Expressions, pages 103–118. Lawrence Erlbaum Associates, 1997.
Google Scholar
G. Heidemann and H. Ritter. Objekterkennung mit Neuronalen Netzen. Technical Report 2, Situierte Künstliche Kommunikatoren, SFB 360, Universität Bielefeld, 1996.
Google Scholar
H. Kollnig and H.-H. Nagel. Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen. In Informatik Froschung und Entwicklung, 8, pages 186–196, 1993.
Google Scholar
S. M. Kosslyn. Mental imagery. In D. A. O. et al, editor, Visual Cognition and Action, pages 73–7, Cambridge, Mass, 1990. MIT Press.
Google Scholar
F. Kummert, G. A. Fink, and G. Sagerer. Schritthaltende hybride Objektdetektion. In Mustererkennung 97, 19, pages 137–44, Berlin, 1997. DAGM-Symposium Braunschweig, Springer-Verlag.
Google Scholar
F. Lavagetto, S. Lepsoy, C. Braccini, and S. Curinga. Lip motion modeling and speech driven estimation. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing, pages 183–86, 1997.
Google Scholar
K. Lee. Automatic Speech Recognition: The Development of the SPHINX System. Kluwer Academic Publishers, 1989.
Google Scholar
A. Maßmann, S. Posch, and D. Schlüter. Using markov random fields for contour-based grouping. In Proceedings of International Conference on Image Processing, volume 2, pages 207–42, 1997.
Article Google Scholar
T. Maybury, editor. Intelligent Multimedia Interfaces. AAAI Press/The MIT Press, 1993.
Google Scholar
D. McDonald and E. J. Conklin. Salience as a simplifying metaphor for natural language generation. In Proceedings of AAAI-81, pages 49–51, 1981.
Google Scholar
K. Nagao. Abduction and dynamic preference in plan-based dialogue understanding. In International Joint Conference on Artificial Intelligence, pages 1186–192. Morgan Kaufmann Publishers, Inc., 1993.
Google Scholar
K. Nagao and J. Rekimoto. Ubiquitous talker: Spoken language interaction with real world objects. In International Joint Conference on Artificial Intelligence, pages 1284–290, 1995.
Google Scholar
P. Olivier, T. Maeda, and J. ichi Tsujii. Automatic depiction of spatial descriptions. In Proceedings of AAAI-94, pages 1405–1410, Seattle, WA, 1994.
Google Scholar
W. Richards, A. Jepson, and J. Feldman. Priors, preferences and categorial percepts. In W. Richards and D. Knill, editors, Perception as Bayesian Inference, pages 93–122. Cambridge University Press, 1996.
Google Scholar
G. Socher. Qualitative Scene Descriptions from Images for Integrated Speech and Image Understanding. Dissertationen zur Künstlichen Intelligenz (DISKI 170). infix-Verlag, Sankt Augustin, 1997.
Google Scholar
G. Socher, T. Merz, and S. Posch. 3-D Reconstruction and Camera Calibration from Images with Known Objects. In D. Pycock, editor, Proc. 6th British Machine Vision Conference, pages 167–176, 1995.
Google Scholar
G. Socher, G. Sagerer, and P. Perona. Baysian Reasoning on Qualitative Descriptions from Images and Speech. In H. Buxton and A. Mukerjee, editors, ICCV’98 Workshop on Conceptual Description of Images, Bombay, India, 1998.
Google Scholar
R. K. Srihari. Computational models for integrating linguistic and visual information: A survey. In Artificial Intelligence Review, 8, pages 349–369, Netherlands, 1994. Kluwer Academic Publishers.
Article Google Scholar
R. K. Srihari and D. T. Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In Proceedings of AAAI-94, pages 793–798, Seattle, WA, 1994.
Google Scholar
J. K. Tsotsos and etal. The PLAYBOT Project. In J. Aronis, editor, IJCAI’Workshop on AI Applications for Disabled People, Montreal, 1995.
Google Scholar
J. K. Tsotsos, G. Verghese, S. Dickenson, M. Jenkin, A. Jepson, E. Milios, F. Nuflo, S. Stevenson, M. Black, D. Metaxas, S. Culhane, Y. Yet, and R. Mann. Playbot: A visuallyguided robot for physically disabled children. Image and Vision Computing, 16(4):275–292, 1998.
Article Google Scholar
G. Verghese and J. K. Tsotsos. Real-time model-based tracking using perspective alignment. In Proceedings of Vision Interface’ pages 202–209, 1994.
Google Scholar
C. Vorwerg, G. Socher, T. Fuhr, G. Sagerer, and G. Rickheit. Projective relations for 3D space: computational model, application, and psychological evaluation. In Proceedings of the 14th National Joint Conference on Artificial Intelligence AAAI-97, Rhode Island, 1997.
Google Scholar
S. Wachsmuth, G. A. Fink, and G. Sagerer. Integration of parsing and incremental speech recognition. In Proceedings EUSIPCO-98, 1998.
Google Scholar
W. Wahlster. One word says more than a thousand pictures. on the automatic verbalization of the results of image sequence analysis systems. In Computers and Artificial Intelligence, 8, pages 479–492, 1989.
Google Scholar
D. L. Waltz. Generating and understanding scene descriptions. In B. Webber and I. Sag, editors, Elements of Discourse Understanding, pages 266–282, New York, NY, 1981. Cambridge University Press.
Google Scholar
M. Zancanaro, O. Stock, and C. Strapparava. Dialogue cohension sharing and adjusting in an enhanced multimodal environment. In International Joint Conference on Artificial Intelligence, pages 1230–1236. Morgan Kaufmann Publishers, Inc., 1993.
Google Scholar

Download references

Author information

Gudrun Socher
Present address: Vidam Communications Inc., 2 N 1st St., San Jose, CA, 95113

Authors and Affiliations

Technical Faculty, University of Bielefeld, P.O. Box 100131, 33501, Beilefeld, Germany
Sven Wachsmuth, Hans Brandt-Pook, Gudrun Socher, Franz Kummert & Gerhard Sagerer

Authors

Sven Wachsmuth
View author publications
You can also search for this author in PubMed Google Scholar
Hans Brandt-Pook
View author publications
You can also search for this author in PubMed Google Scholar
Gudrun Socher
View author publications
You can also search for this author in PubMed Google Scholar
Franz Kummert
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Sagerer
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wachsmuth, S., Brandt-Pook, H., Socher, G., Kummert, F., Sagerer, G. (1999). Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks. In: Computer Vision Systems. ICVS 1999. Lecture Notes in Computer Science, vol 1542. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49256-9_15

Download citation

DOI: https://doi.org/10.1007/3-540-49256-9_15
Published: 24 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65459-9
Online ISBN: 978-3-540-49256-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics