Audio-Visual Clustering for 3D Speaker Localization

Khalidov, Vasil; Forbes, Florence; Hansard, Miles; Arnaud, Elise; Horaud, Radu

doi:10.1007/978-3-540-85853-9_8

Vasil Khalidov¹,
Florence Forbes¹,
Miles Hansard¹,
Elise Arnaud^1,2 &
…
Radu Horaud¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5237))

Included in the following conference series:

International Workshop on Machine Learning for Multimodal Interaction

929 Accesses

Abstract

We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Speaker Tracking on Multiple-Manifolds with Distributed Microphones

Audio-Visual Source Separation with Alternating Diffusion Maps

Multiview Approaches to Event Detection and Scene Analysis

References

Heckmann, M., Berthommier, F., Kroschel, K.: Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Proc. 11, 1260–1273 (2002)
Article Google Scholar
Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. PAMI 25(7), 828–836 (2003)
Google Scholar
Kushal, A., Rahurkar, M., Fei-Fei, L., Ponce, J., Huang, T.: Audio-visual speaker localization using graphical models. In: Proc.18th Int. Conf. Pat. Rec., pp. 291–294 (2006)
Google Scholar
Zotkin, D.N., Duraiswami, R., Davis, L.S.: Joint audio-visual tracking using particle filters. EURASIP Journal on Applied Signal Processing 11, 1154–1164 (2002)
Article Google Scholar
Vermaak, J., Ganget, M., Blake, A., Pérez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. In: Proc. 8th Int. Conf. Comput. Vision, pp. 741–746 (2001)
Google Scholar
Perez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proc. of IEEE (spec. issue on Sequential State Estimation) 92, 495–513 (2004)
Google Scholar
Chen, Y., Rui, Y.: Real-time speaker tracking using particle filter sensor fusion. Proc. of IEEE (spec. issue on Sequential State Estimation) 92, 485–494 (2004)
Google Scholar
Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: ICMI 2005, pp. 61–68 (2005)
Google Scholar
Checka, N., Wilson, K., Siracusa, M., Darrell, T.: Multiple person and speaker activity tracking with a particle filter. In: IEEE Conf. Acou. Spee. Sign. Proc., pp. 881–884 (2004)
Google Scholar
Gatica-Perez, D., Lathoud, G., Odobez, J.-M., McCowan, I.: Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE trans. Audi. Spee. Lang. Proc. 15(2), 601–616 (2007)
Article Google Scholar
Fisher, J., Darrell, T.: Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia 6(3), 406–413 (2004)
Article Google Scholar
Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conf. Comput. Vision Pat. Rec (CVPR), pp. 1–8 (2007)
Google Scholar
Christensen, H., Ma, N., Wrigley, S.N., Barker, J.: Integrating pitch and localisation cues at a speech fragment level. In: Proc. of Interspeech 2007, pp. 2769–2772 (2007)
Google Scholar
Movellan, J.R., Chadderdon, G.: Channel separability in the audio-visual integration of speech: A bayesian approach. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines: Models, Systems and Applications. NATO ASI Series, pp. 473–487. Springer, Berlin (1996)
Google Scholar
Massaro, D.W., Stork, D.G.: Speech recognition and sensory integration. American Scientist 86(3), 236–244 (1998)
Google Scholar
Celeux, G., Forbes, F., Peyrard, N.: EM procedures using mean-field approximations for Markov model-based image segmentation. Pattern Recognition 36, 131–144 (2003)
Article MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39(1), 1–38 (1977)
MATH MathSciNet Google Scholar
Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464 (1978)
Article MATH MathSciNet Google Scholar
Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conference, pp. 147–151 (1988)
Google Scholar
Hansard, M., Horaud, R.P.: Patterns of binocular disparity for a fixating observer. In: Adv. Brain Vision Artif. Intel., 2nd Int. Symp., pp. 308–317 (2007)
Google Scholar
Intel OpenCV Computer Vision library, http://www.intel.com/technology/computing/opencv
Viola, P., Jones, M.: Robust real-time face detection. IJCV 57(2), 137–154 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

INRIA Grenoble Rhône-Alpes, 655 avenue de l’Europe, 38334, Montbonnot, France
Vasil Khalidov, Florence Forbes, Miles Hansard, Elise Arnaud & Radu Horaud
Université Joseph Fourier, BP 53, 38041, Grenoble Cedex 9, France
Elise Arnaud

Authors

Vasil Khalidov
View author publications
You can also search for this author in PubMed Google Scholar
Florence Forbes
View author publications
You can also search for this author in PubMed Google Scholar
Miles Hansard
View author publications
You can also search for this author in PubMed Google Scholar
Elise Arnaud
View author publications
You can also search for this author in PubMed Google Scholar
Radu Horaud
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrei Popescu-Belis Rainer Stiefelhagen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., Horaud, R. (2008). Audio-Visual Clustering for 3D Speaker Localization. In: Popescu-Belis, A., Stiefelhagen, R. (eds) Machine Learning for Multimodal Interaction. MLMI 2008. Lecture Notes in Computer Science, vol 5237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85853-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-540-85853-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85852-2
Online ISBN: 978-3-540-85853-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics