Abstract
Automatic understanding of multi-modal input is the central topic in modern human Computer interfaces. But the basic questions about how the interpretations provided by different modalities can be connected in a universal and robust manner is still an open problem. The most intuitive input modalities, speech perception and vision, can only be correlated on a qualitative content based interpretation level. But, due to vague meanings and erroneous processing results this is extremely difficult to accomplish. A simple frame based integration scheme filling appropriate slots with new analysis results will fail when ambiguous or contradictory information appears. In this paper we propose a new probabilistic framework to overcome these drawbacks. The integration model is built up from data collected in labeled test sets and psycholinguistic experiments. Thereby, the correspondence problem is solved in a very robust and universal manner. In particular, we will show that erroneous visual interpretations can be corrected by a joint analysis of visual and speech input data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
U. Ahlrichs, J. Fischer, J. Denzler, C. Drexler, H. Niemann, E. Nöth, and D. Paulus. Knowledge based image and speech analysis for Service robots. In Integration of Speech and Image Understanding, p. 21-48, Corfu, Greece, 1999. IEEE Comp. Soc.
C. Bauckhage, F. Kümmert, and G. Sagerer. Modeling and Recognition of Assembled Objects. In IECON’98 Proceedings of the 24th Annual Conference of the IEEE Industrial Electronics Society, p. 2051-2056, 1998.
H. Brandt-Pook, G.A. Fink, S. Wachsmuth, and G. Sagerer. Integrated recognition and interpretation of speech for a construction task domain. In Proc. of the Int. Conference on Human Computer Interaction (HCl), volume 1, p. 550–554, 1999.
C. Brindöpke, M. Johanntokrax, A. Pahde, and B. Wrede. “‘Darf ich dich Marvin nennen?’” — Instruktionsdialoge in einem Wizard-of-Oz-Szenario: Materialband. Report 7/95, Sonderforschungsbereich 360 ‘“Situierte Künstliche Kommunikatoren’”, Universität Bielefeld, 1995.
G.A. Fink. Developing HMM-based recognizers with ESMERALDA. In V. Matousek, P. Mautner, J. Ocelfkova, and P. Sojka, editors, Lecture Notes in Artificial Intelligence, volume 1692, p. 229–234, Berlin, 1999. Springer.
G. Furnas, T. Landauer, L. Gomez, and S. Dumais. The vobabulary problem in human-system communication. Communications of ACM, 30(11) 1987.
R. Jackendoff. Languages of the Mind. The MIT Press, 1992.
F. Kümmert, G. Fink, G. Sagerer, and E. Braun. Hybrid Object Recognition in Image Sequences. In 14th International Conference on Pattern Recognition, volume II, p. 1165–1170, Brisbane, 1998.
F. Kümmert, H. Niemann, R. Prechtel, and G. Sagerer. Control and Explanation in a Signal Understanding Environment. Signal Processing, special issue on ‘Intelligent Systems for Signal and Image Understanding’, 32:111–145, 1993.
A. Mukerjee. Neat vs scruffy: A review of computational models for spatial expressions. In P. Olivier and K.-P. Gapp, editors, Representation and processing of spatial expressions. Lawrence Erlbaum Associates, 1997.
K. Nagao and J. Rekimoto.Ubiquitous talker: Spoken language interaction with real world objects. In Proceedings of IJCAI-95, p. 1284-1290, 1995.
J. Pearl. Probabilstic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1989.
G. Socher, G. Sagerer, and P. Perona. Baysian Reasoning on Qualitative Descriptions from Images and Speech. In H. Buxton and A. Mukerjee, editors, ICCV 98 Workshop on Conceptual Description of Images, Bombay, India, 1998.
R.K. Srihari. Computational models for integrating linguistic and visual information: A survey. In Artificial Intelligence Review, 8, p. 349–369, Netherlands, 1994. Kluwer Academic Publishers.
R.K. Srihari and D.T. Burhans. Visual semantics: extracting visual information from text accompanying pictures. In Proc. of AAAI-94, p. 793-798, Seattle, 1994.
B. Suhm, B. Myers, and A. Waibel. Interactive recovery from speech recognition errors in speech user interfaces. In Proc. ICSLP ’96, volume 2, p. 865-868, Philadelphia, PA, Oct. 1996.
L. Ungerleider and M. Mishkin. Two cortical visual systems. In Analysis of Visual Behaviour, p. 549-586. The MIT Press, 1982.
S. Wachsmuth, H. Brandt-Pook, G. Socher, F. Kümmert, and G. Sagerer. Multilevel integration of vision and speech understanding using bayesian networks. In H.I. Christensen, editor, Computer Vision Systems: First International Conference, volume 1542 of Lecture Notes in Computer Science, p. 231–254, Las Palmas, Gran Canaria, Spain, Jan. 1999. Springer-Verlag.
S. Wachsmuth, G.A. Fink, and G. Sagerer. Integration of parsing and incremental speech recognition. In Proceedings of the European Signal Processing Conference (EUSIPCO-98), volume 1, p. 371-375, Rhodes, Sept. 1998.
A. Waibel, B. Suhm, M.T. Vo, and J. Yang. Multimodal interfaces for multimedia information agents. InProc. ICASSP ’97, volume 1, p. 167–170, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wachsmuth, S., Fink, G.A., Kümmert, F., Sagerer, G. (2000). Using Speech in Visual Object Recognition. In: Sommer, G., Krüger, N., Perwass, C. (eds) Mustererkennung 2000. Informatik aktuell. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-59802-9_54
Download citation
DOI: https://doi.org/10.1007/978-3-642-59802-9_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67886-1
Online ISBN: 978-3-642-59802-9
eBook Packages: Springer Book Archive