Abstract
Audio-based interfaces usually suffer when noise or other acoustic sources are present in the environment. For robust audio recognition, a single source must first be isolated. Existing solutions to this problem generally require special microphone configurations, and often assume prior knowledge of the spurious sources. We have developed new algorithms for segmenting streams of audio-visual information into their constituent sources by exploiting the mutual information present between audio and visual tracks. Automatic face recognition and image motion analysis methods are used to generate visual features for a particular user; empirically these features have high mutual information with audio recorded from that user. We show how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams; we also show how the method can help reduce the effect of noise in automatic speech recognition.
contact: MIT AI Lab Room NE43-829, 545 Technology Square, Cambridge MA 02139 USA. Phone:617 253 8966, Fax:617 253 5060, Email: trevor@ai.mit.edu
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
B. Arons. A Review of The Cocktail Party Effect. Journal of the American Voice I/O Society 12, 35–50, 1992.
A. Bell and T. Sejnowski. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129–1159, 1995.
M. Black and P. Anandan, A framework for the robust estimation of optical flow, Fourth International Conf. on Computer Vision, ICCV-93, Berlin, Germany, May, 1993, pp. 231–236
M. A. Casey, W. G. Gardner, and S. Basu “Vision Steered Beamforming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE)”, Proceedings of the 99th Convention of the Aud. Eng. Soc. (AES), 1995.
J. Fisher and J. Principe. Unsupervised learning for nonlinear synthetic discriminant functions. In D. Casasent and T. Chao, eds., SPIE Optical Pattern Recognition VII, vol 2752, p 2–13, 1996.
J. W. Fisher III, A. T. Ihler, and P. Viola, “Learning Informative Statistics: A Nonparametric Approach,” Advances in Neural Information Processing Systems, Denver, Colorado, November 29–December 4, 1999.
J. W. Fisher III, T. Darrell, W. Freeman, and P. Viola, Learning Joint Statistical Models for Audio-Visual Fusion and Segregation, in review.
J. Hershey and J. Movellan. Using audio-visual synchrony to locate sounds. In T. K. L. S. A. Solla and K.-R. Muller, editors, Proceedings of 1999 Conference on Advances in Neural Information Processing Systems 12, 1999.
Q. Lin, E. Jan, and J. Flanagan. “Microphone Arrays and Speaker Identification.” IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 622–629, 1994.
T. Nakamura, S. Shikano, and K. Nara. “An Effect of Adaptive Beamforming on Hands-free Speech Recognition Based on 3-D Viterbi Search”, ICSLP’98 Proceedings Australian Speech Science and Technology Association, Incorporated (ASSTA), Volume 2, p. 381, 1998.
B. Pearlmutter and L. Parra. “A context-sensitive generalization of ICA”, Proc. ICONIP’ 96, Japan, 1996.
H. Rowley, S. Baluja, and T. Kanade, Neural Network-Based Face Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, CVPR-96, pp. 203–207,. IEEE Computer Society Press. 1996.
P. Smaragdis, “Blind Separation of Convolved Mixtures in the Frequency Domain.” International Workshop on Independence & Artificial Neural Networks University of La Laguna, Tenerife, Spain, February 9–10, 1998.
P. Viola, N. Schraudolph, and T. Sejnowski. Empirical entropy manipulation for real-world problems. In Proceedings of 1996 Conference on Advances in Neural Information Processing Systems 8, pages 851–7, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Darrell, T., Fisher, J.W., Viola, P. (2000). Audio-visual Segmentation and “The Cocktail Party Effect”. In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_5
Download citation
DOI: https://doi.org/10.1007/3-540-40063-X_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41180-2
Online ISBN: 978-3-540-40063-9
eBook Packages: Springer Book Archive