Abstract
We address the issue of localizing individuals in a scene that contains several people engaged in a multiple-speaker conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data to a representation of the common 3D scene-space, via a pair of Gaussian mixture models. Inference is performed by a version of the Expectation Maximization algorithm, which provides cooperative estimates of both the activity (speaking or not) and the 3D position of each speaker.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Heckmann, M., Berthommier, F., Kroschel, K.: Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Proc. 11, 1260–1273 (2002)
Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. PAMI 25(7), 828–836 (2003)
Kushal, A., Rahurkar, M., Fei-Fei, L., Ponce, J., Huang, T.: Audio-visual speaker localization using graphical models. In: Proc.18th Int. Conf. Pat. Rec., pp. 291–294 (2006)
Zotkin, D.N., Duraiswami, R., Davis, L.S.: Joint audio-visual tracking using particle filters. EURASIP Journal on Applied Signal Processing 11, 1154–1164 (2002)
Vermaak, J., Ganget, M., Blake, A., Pérez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. In: Proc. 8th Int. Conf. Comput. Vision, pp. 741–746 (2001)
Perez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proc. of IEEE (spec. issue on Sequential State Estimation) 92, 495–513 (2004)
Chen, Y., Rui, Y.: Real-time speaker tracking using particle filter sensor fusion. Proc. of IEEE (spec. issue on Sequential State Estimation) 92, 485–494 (2004)
Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: ICMI 2005, pp. 61–68 (2005)
Checka, N., Wilson, K., Siracusa, M., Darrell, T.: Multiple person and speaker activity tracking with a particle filter. In: IEEE Conf. Acou. Spee. Sign. Proc., pp. 881–884 (2004)
Gatica-Perez, D., Lathoud, G., Odobez, J.-M., McCowan, I.: Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE trans. Audi. Spee. Lang. Proc. 15(2), 601–616 (2007)
Fisher, J., Darrell, T.: Speaker association with signal-level audiovisual fusion. IEEE Trans. on Multimedia 6(3), 406–413 (2004)
Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: IEEE Conf. Comput. Vision Pat. Rec (CVPR), pp. 1–8 (2007)
Christensen, H., Ma, N., Wrigley, S.N., Barker, J.: Integrating pitch and localisation cues at a speech fragment level. In: Proc. of Interspeech 2007, pp. 2769–2772 (2007)
Movellan, J.R., Chadderdon, G.: Channel separability in the audio-visual integration of speech: A bayesian approach. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines: Models, Systems and Applications. NATO ASI Series, pp. 473–487. Springer, Berlin (1996)
Massaro, D.W., Stork, D.G.: Speech recognition and sensory integration. American Scientist 86(3), 236–244 (1998)
Celeux, G., Forbes, F., Peyrard, N.: EM procedures using mean-field approximations for Markov model-based image segmentation. Pattern Recognition 36, 131–144 (2003)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39(1), 1–38 (1977)
Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464 (1978)
Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. 4th Alvey Vision Conference, pp. 147–151 (1988)
Hansard, M., Horaud, R.P.: Patterns of binocular disparity for a fixating observer. In: Adv. Brain Vision Artif. Intel., 2nd Int. Symp., pp. 308–317 (2007)
Intel OpenCV Computer Vision library, http://www.intel.com/technology/computing/opencv
Viola, P., Jones, M.: Robust real-time face detection. IJCV 57(2), 137–154 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Khalidov, V., Forbes, F., Hansard, M., Arnaud, E., Horaud, R. (2008). Audio-Visual Clustering for 3D Speaker Localization. In: Popescu-Belis, A., Stiefelhagen, R. (eds) Machine Learning for Multimodal Interaction. MLMI 2008. Lecture Notes in Computer Science, vol 5237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85853-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-85853-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85852-2
Online ISBN: 978-3-540-85853-9
eBook Packages: Computer ScienceComputer Science (R0)