Skip to main content

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

  • Conference paper
  • First Online:
Image and Video Retrieval (CIVR 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2728))

Included in the following conference series:

Abstract

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speaker’s mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. W. Adams, G. Iyengar, C.-Y. Lin, M. R. Naphade, C. Neti, H. J. Nock, and J. R. Smith. Semantic Indexing of Multimedia Content Using Visual, Audio and Text Cues. Eurasip Journal on Applied Signal Processing, 2:170–185, 2003.

    Article  Google Scholar 

  2. T. Butz and J.-P. Thiran. Feature Space Mutual Information In Speech-Video Sequences. In Proc. ICME, Lausanne, Switzerland, 2002.

    Google Scholar 

  3. S. Chen and P. Gopalakrishnan. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion. In Proc. DARPA Broadcast News Transcription & Understanding Workshop, VA, USA, 1998.

    Google Scholar 

  4. J. Connell, N. Haas, E. Marcheret, C. Neti, G. Potamianos, and S. Velipasalar. A Real-Time Prototype for Small-Vocabulary Audio-Visual ASR. In ICME (Submitted), 2003.

    Google Scholar 

  5. T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991.

    Google Scholar 

  6. R. Cutler and L. Davis. Look Who’s Talking: Speaker Detection using Video and Audio Correlation. In Proc. ICME, NY, USA, 2000.

    Google Scholar 

  7. J.W. Fisher III and T. Darrell. Informative Subspaces for Audiovisual Processing: High-Level Function from Low-Level Fusion. In Proc. ICASSP, 2002.

    Google Scholar 

  8. R. Gopinath. Maximum Likelihood Modeling with Gaussian Distributions for Classification. In Proc. ICASSP, volume 2, pages 661–664, WA, USA, 1998.

    Google Scholar 

  9. J. Hershey and J. Movellan. Using Audio-Visual Synchrony to Locate Sounds. In Proc. NIPS, 1999.

    Google Scholar 

  10. G. Iyengar, H. Nock, and C. Neti. Audio-Visual Synchrony for Detection of Monologues in Video Archives. In Proc. ICASSP, Hong Kong, 2003.

    Google Scholar 

  11. H. Nock, G. Iyengar, and C. Neti. Assessing Face and Speech Consistency for Monologue Detection in Video. In Proc. ACM Multimedia, Juan-les-Pins, France, 2002.

    Google Scholar 

  12. E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy. Moving Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus. Eurasip Journal on Applied Signal Processing, 11:1189–1201, 2002.

    Article  Google Scholar 

  13. G. Potamianos, J. Luettin, and C. Neti. Hierarchical Discriminant Features for Audio-Visual LVCSR. In Proc. ICASSP, pages 165–168, 2001.

    Google Scholar 

  14. M. Slaney and M. Covell. FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Proc. NIPS, 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nock, H.J., Iyengar, G., Neti, C. (2003). Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study. In: Bakker, E.M., Lew, M.S., Huang, T.S., Sebe, N., Zhou, X.S. (eds) Image and Video Retrieval. CIVR 2003. Lecture Notes in Computer Science, vol 2728. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45113-7_48

Download citation

  • DOI: https://doi.org/10.1007/3-540-45113-7_48

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40634-1

  • Online ISBN: 978-3-540-45113-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics