Abstract
This paper reviews definitions of audio-visual synchrony and examines their empirical behaviour on test sets up to 200 times larger than used by other authors. The results give new insights into the practical utility of existing synchrony definitions and justify application of audio-visual synchrony techniques to the problem of active speaker localisation in broadcast video. Performance is evaluated using a test set of twelve clips of alternating speakers from the multiple speaker CUAVE corpus. Accuracy of 76% is obtained for the task of identifying the active member of a speaker pair at different points in time, comparable to performance given by two purely video image-based schemes. Accuracy of 65% is obtained on the more challenging task of locating a point within a 100×100 pixel square centered on the active speaker’s mouth without no prior face detection; the performance upper bound if perfect face detection were available is 69%. This result is significantly better than two purely video image-based schemes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
W. Adams, G. Iyengar, C.-Y. Lin, M. R. Naphade, C. Neti, H. J. Nock, and J. R. Smith. Semantic Indexing of Multimedia Content Using Visual, Audio and Text Cues. Eurasip Journal on Applied Signal Processing, 2:170–185, 2003.
T. Butz and J.-P. Thiran. Feature Space Mutual Information In Speech-Video Sequences. In Proc. ICME, Lausanne, Switzerland, 2002.
S. Chen and P. Gopalakrishnan. Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion. In Proc. DARPA Broadcast News Transcription & Understanding Workshop, VA, USA, 1998.
J. Connell, N. Haas, E. Marcheret, C. Neti, G. Potamianos, and S. Velipasalar. A Real-Time Prototype for Small-Vocabulary Audio-Visual ASR. In ICME (Submitted), 2003.
T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991.
R. Cutler and L. Davis. Look Who’s Talking: Speaker Detection using Video and Audio Correlation. In Proc. ICME, NY, USA, 2000.
J.W. Fisher III and T. Darrell. Informative Subspaces for Audiovisual Processing: High-Level Function from Low-Level Fusion. In Proc. ICASSP, 2002.
R. Gopinath. Maximum Likelihood Modeling with Gaussian Distributions for Classification. In Proc. ICASSP, volume 2, pages 661–664, WA, USA, 1998.
J. Hershey and J. Movellan. Using Audio-Visual Synchrony to Locate Sounds. In Proc. NIPS, 1999.
G. Iyengar, H. Nock, and C. Neti. Audio-Visual Synchrony for Detection of Monologues in Video Archives. In Proc. ICASSP, Hong Kong, 2003.
H. Nock, G. Iyengar, and C. Neti. Assessing Face and Speech Consistency for Monologue Detection in Video. In Proc. ACM Multimedia, Juan-les-Pins, France, 2002.
E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy. Moving Talker, Speaker-Independent Feature Study and Baseline Results Using the CUAVE Multimodal Speech Corpus. Eurasip Journal on Applied Signal Processing, 11:1189–1201, 2002.
G. Potamianos, J. Luettin, and C. Neti. Hierarchical Discriminant Features for Audio-Visual LVCSR. In Proc. ICASSP, pages 165–168, 2001.
M. Slaney and M. Covell. FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Proc. NIPS, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nock, H.J., Iyengar, G., Neti, C. (2003). Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study. In: Bakker, E.M., Lew, M.S., Huang, T.S., Sebe, N., Zhou, X.S. (eds) Image and Video Retrieval. CIVR 2003. Lecture Notes in Computer Science, vol 2728. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45113-7_48
Download citation
DOI: https://doi.org/10.1007/3-540-45113-7_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40634-1
Online ISBN: 978-3-540-45113-6
eBook Packages: Springer Book Archive