Skip to main content

Audio-visual Segmentation and “The Cocktail Party Effect”

  • Conference paper
  • First Online:
Advances in Multimodal Interfaces — ICMI 2000 (ICMI 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1948))

Included in the following conference series:

Abstract

Audio-based interfaces usually suffer when noise or other acoustic sources are present in the environment. For robust audio recognition, a single source must first be isolated. Existing solutions to this problem generally require special microphone configurations, and often assume prior knowledge of the spurious sources. We have developed new algorithms for segmenting streams of audio-visual information into their constituent sources by exploiting the mutual information present between audio and visual tracks. Automatic face recognition and image motion analysis methods are used to generate visual features for a particular user; empirically these features have high mutual information with audio recorded from that user. We show how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams; we also show how the method can help reduce the effect of noise in automatic speech recognition.

contact: MIT AI Lab Room NE43-829, 545 Technology Square, Cambridge MA 02139 USA. Phone:617 253 8966, Fax:617 253 5060, Email: trevor@ai.mit.edu

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. B. Arons. A Review of The Cocktail Party Effect. Journal of the American Voice I/O Society 12, 35–50, 1992.

    Google Scholar 

  2. A. Bell and T. Sejnowski. An information maximisation approach to blind separation and blind deconvolution, Neural Computation, 7, 6, 1129–1159, 1995.

    Article  Google Scholar 

  3. M. Black and P. Anandan, A framework for the robust estimation of optical flow, Fourth International Conf. on Computer Vision, ICCV-93, Berlin, Germany, May, 1993, pp. 231–236

    Google Scholar 

  4. M. A. Casey, W. G. Gardner, and S. Basu “Vision Steered Beamforming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE)”, Proceedings of the 99th Convention of the Aud. Eng. Soc. (AES), 1995.

    Google Scholar 

  5. J. Fisher and J. Principe. Unsupervised learning for nonlinear synthetic discriminant functions. In D. Casasent and T. Chao, eds., SPIE Optical Pattern Recognition VII, vol 2752, p 2–13, 1996.

    Google Scholar 

  6. J. W. Fisher III, A. T. Ihler, and P. Viola, “Learning Informative Statistics: A Nonparametric Approach,” Advances in Neural Information Processing Systems, Denver, Colorado, November 29–December 4, 1999.

    Google Scholar 

  7. J. W. Fisher III, T. Darrell, W. Freeman, and P. Viola, Learning Joint Statistical Models for Audio-Visual Fusion and Segregation, in review.

    Google Scholar 

  8. J. Hershey and J. Movellan. Using audio-visual synchrony to locate sounds. In T. K. L. S. A. Solla and K.-R. Muller, editors, Proceedings of 1999 Conference on Advances in Neural Information Processing Systems 12, 1999.

    Google Scholar 

  9. Q. Lin, E. Jan, and J. Flanagan. “Microphone Arrays and Speaker Identification.” IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 622–629, 1994.

    Article  Google Scholar 

  10. T. Nakamura, S. Shikano, and K. Nara. “An Effect of Adaptive Beamforming on Hands-free Speech Recognition Based on 3-D Viterbi Search”, ICSLP’98 Proceedings Australian Speech Science and Technology Association, Incorporated (ASSTA), Volume 2, p. 381, 1998.

    Google Scholar 

  11. B. Pearlmutter and L. Parra. “A context-sensitive generalization of ICA”, Proc. ICONIP’ 96, Japan, 1996.

    Google Scholar 

  12. H. Rowley, S. Baluja, and T. Kanade, Neural Network-Based Face Detection, Proc. IEEE Conf. Computer Vision and Pattern Recognition, CVPR-96, pp. 203–207,. IEEE Computer Society Press. 1996.

    Google Scholar 

  13. P. Smaragdis, “Blind Separation of Convolved Mixtures in the Frequency Domain.” International Workshop on Independence & Artificial Neural Networks University of La Laguna, Tenerife, Spain, February 9–10, 1998.

    Google Scholar 

  14. P. Viola, N. Schraudolph, and T. Sejnowski. Empirical entropy manipulation for real-world problems. In Proceedings of 1996 Conference on Advances in Neural Information Processing Systems 8, pages 851–7, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Darrell, T., Fisher, J.W., Viola, P. (2000). Audio-visual Segmentation and “The Cocktail Party Effect”. In: Tan, T., Shi, Y., Gao, W. (eds) Advances in Multimodal Interfaces — ICMI 2000. ICMI 2000. Lecture Notes in Computer Science, vol 1948. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40063-X_5

Download citation

  • DOI: https://doi.org/10.1007/3-540-40063-X_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41180-2

  • Online ISBN: 978-3-540-40063-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics