Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Nock, Harriet J.; Iyengar, Giridharan; Neti, Chalapathy

doi:10.1007/3-540-45113-7_48

Harriet J. Nock⁸,
Giridharan Iyengar⁸ &
Chalapathy Neti⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2728))

Included in the following conference series:

International Conference on Image and Video Retrieval

1338 Accesses
27 Citations

Abstract

This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speaker’s mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Voice activity detection based on facial movement

Article Open access 22 July 2015

A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

Article Open access 28 November 2024

Cross-Modal Active Speaker Detection Algorithm in Video and End-To-End Landing Solution

References

W. Adams, G. Iyengar, C.-Y. Lin, M. R. Naphade, C. Neti, H. J. Nock, and J. R. Smith. Semantic Indexing of Multimedia Content Using Visual, Audio and Text Cues. Eurasip Journal on Applied Signal Processing, 2:170–185, 2003.
Article Google Scholar
T. Butz and J.-P. Thiran. Feature Space Mutual Information In Speech-Video Sequences. In Proc. ICME, Lausanne, Switzerland, 2002.
Google Scholar
S. Chen and P. Gopalakrishnan. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion. In Proc. DARPA Broadcast News Transcription & Understanding Workshop, VA, USA, 1998.
Google Scholar
J. Connell, N. Haas, E. Marcheret, C. Neti, G. Potamianos, and S. Velipasalar. A Real-Time Prototype for Small-Vocabulary Audio-Visual ASR. In ICME (Submitted), 2003.
Google Scholar
T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991.
Google Scholar
R. Cutler and L. Davis. Look Who’s Talking: Speaker Detection using Video and Audio Correlation. In Proc. ICME, NY, USA, 2000.
Google Scholar
J.W. Fisher III and T. Darrell. Informative Subspaces for Audiovisual Processing: High-Level Function from Low-Level Fusion. In Proc. ICASSP, 2002.
Google Scholar
R. Gopinath. Maximum Likelihood Modeling with Gaussian Distributions for Classification. In Proc. ICASSP, volume 2, pages 661–664, WA, USA, 1998.
Google Scholar
J. Hershey and J. Movellan. Using Audio-Visual Synchrony to Locate Sounds. In Proc. NIPS, 1999.
Google Scholar
G. Iyengar, H. Nock, and C. Neti. Audio-Visual Synchrony for Detection of Monologues in Video Archives. In Proc. ICASSP, Hong Kong, 2003.
Google Scholar
H. Nock, G. Iyengar, and C. Neti. Assessing Face and Speech Consistency for Monologue Detection in Video. In Proc. ACM Multimedia, Juan-les-Pins, France, 2002.
Google Scholar
E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy. Moving Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus. Eurasip Journal on Applied Signal Processing, 11:1189–1201, 2002.
Article Google Scholar
G. Potamianos, J. Luettin, and C. Neti. Hierarchical Discriminant Features for Audio-Visual LVCSR. In Proc. ICASSP, pages 165–168, 2001.
Google Scholar
M. Slaney and M. Covell. FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Proc. NIPS, 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM TJ Watson Research Center, PO Box 218, Yorktown Heights, NY, 10598, USA
Harriet J. Nock, Giridharan Iyengar & Chalapathy Neti

Authors

Harriet J. Nock
View author publications
You can also search for this author in PubMed Google Scholar
Giridharan Iyengar
View author publications
You can also search for this author in PubMed Google Scholar
Chalapathy Neti
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIACS Media Lab, Leiden University, Niels Bohrweg 1, 2333 CA, Leiden, The Netherlands
Erwin M. Bakker & Michael S. Lew &
Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, 405 N. Mathews Avenue, Urbana, IL, 61801, USA
Thomas S. Huang
University of Amsterdam, Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands
Nicu Sebe
Siemens Corporate Research, 755 College Road East, Princeton, NJ, 08540, USA
Xiang Sean Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nock, H.J., Iyengar, G., Neti, C. (2003). Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study. In: Bakker, E.M., Lew, M.S., Huang, T.S., Sebe, N., Zhou, X.S. (eds) Image and Video Retrieval. CIVR 2003. Lecture Notes in Computer Science, vol 2728. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45113-7_48

Download citation

DOI: https://doi.org/10.1007/3-540-45113-7_48
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40634-1
Online ISBN: 978-3-540-45113-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics