Audio-visual Segmentation and “The Cocktail Party Effect”

Darrell, Trevor; Fisher, John W.; Viola, Paul

doi:10.1007/3-540-40063-X_5

Trevor Darrell⁷,
John W. Fisher III⁷ &
Paul Viola⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1948))

Included in the following conference series:

International Conference on Multimodal Interfaces

1079 Accesses
16 Citations

Abstract

Audio-based interfaces usually suffer when noise or other acoustic sources are present in the environment. For robust audio recognition, a single source must first be isolated. Existing solutions to this problem generally require special microphone configurations, and often assume prior knowledge of the spurious sources. We have developed new algorithms for segmenting streams of audio-visual information into their constituent sources by exploiting the mutual information present between audio and visual tracks. Automatic face recognition and image motion analysis methods are used to generate visual features for a particular user; empirically these features have high mutual information with audio recorded from that user. We show how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams; we also show how the method can help reduce the effect of noise in automatic speech recognition.

contact: MIT AI Lab Room NE43-829, 545 Technology Square, Cambridge MA 02139 USA. Phone:617 253 8966, Fax:617 253 5060, Email: trevor@ai.mit.edu

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B. Arons. A Review of The Cocktail Party Effect. Journal of the American Voice I/O Society 12, 35–50, 1992.
Google Scholar
A. Bell and T. Sejnowski. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129–1159, 1995.
Article Google Scholar
M. Black and P. Anandan, A framework for the robust estimation of optical flow, Fourth International Conf. on Computer Vision, ICCV-93, Berlin, Germany, May, 1993, pp. 231–236
Google Scholar
M. A. Casey, W. G. Gardner, and S. Basu “Vision Steered Beamforming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE)”, Proceedings of the 99th Convention of the Aud. Eng. Soc. (AES), 1995.
Google Scholar
J. Fisher and J. Principe. Unsupervised learning for nonlinear synthetic discriminant functions. In D. Casasent and T. Chao, eds., SPIE Optical Pattern Recognition VII, vol 2752, p 2–13, 1996.
Google Scholar
J. W. Fisher III, A. T. Ihler, and P. Viola, “Learning Informative Statistics: A Nonparametric Approach,” Advances in Neural Information Processing Systems, Denver, Colorado, November 29–December 4, 1999.
Google Scholar
J. W. Fisher III, T. Darrell, W. Freeman, and P. Viola, Learning Joint Statistical Models for Audio-Visual Fusion and Segregation, in review.
Google Scholar
J. Hershey and J. Movellan. Using audio-visual synchrony to locate sounds. In T. K. L. S. A. Solla and K.-R. Muller, editors, Proceedings of 1999 Conference on Advances in Neural Information Processing Systems 12, 1999.
Google Scholar
Q. Lin, E. Jan, and J. Flanagan. “Microphone Arrays and Speaker Identification.” IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 622–629, 1994.
Article Google Scholar
T. Nakamura, S. Shikano, and K. Nara. “An Effect of Adaptive Beamforming on Hands-free Speech Recognition Based on 3-D Viterbi Search”, ICSLP’98 Proceedings Australian Speech Science and Technology Association, Incorporated (ASSTA), Volume 2, p. 381, 1998.
Google Scholar
B. Pearlmutter and L. Parra. “A context-sensitive generalization of ICA”, Proc. ICONIP’ 96, Japan, 1996.
Google Scholar
H. Rowley, S. Baluja, and T. Kanade, Neural Network-Based Face Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, CVPR-96, pp. 203–207,. IEEE Computer Society Press. 1996.
Google Scholar
P. Smaragdis, “Blind Separation of Convolved Mixtures in the Frequency Domain.” International Workshop on Independence & Artificial Neural Networks University of La Laguna, Tenerife, Spain, February 9–10, 1998.
Google Scholar
P. Viola, N. Schraudolph, and T. Sejnowski. Empirical entropy manipulation for real-world problems. In Proceedings of 1996 Conference on Advances in Neural Information Processing Systems 8, pages 851–7, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

MIT AI Lab William Freeman, Mitsubishi Electric Research Lab, USA
Trevor Darrell, John W. Fisher III & Paul Viola

Authors

Trevor Darrell
View author publications
You can also search for this author in PubMed Google Scholar
John W. Fisher III
View author publications
You can also search for this author in PubMed Google Scholar
Paul Viola
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728, 100080, Beijing, China
Tieniu Tan
Computer Department, Media Laboratory, Tsinghua University, 100084, Beijing, China
Yuanchun Shi
Institute of Computing Technology, Chinese Academy of Sciences, P.O.Box 2704, 100080, Beijing, China
Wen Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Darrell, T., Fisher, J.W., Viola, P. (2000). Audio-visual Segmentation and “The Cocktail Party Effect”. In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_5

Download citation

DOI: https://doi.org/10.1007/3-540-40063-X_5
Published: 26 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41180-2
Online ISBN: 978-3-540-40063-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics