Abstract
This paper proposes an approach for Voice Activity Detection (VAD) based on the automatic measurement of gesturing. The main motivation of the work is that gestures have been shown to be tightly correlated with speech, hence they can be considered a reliable evidence that a person is talking. The use of gestures rather than speech for performing VAD can be helpful in many situation (e.g., surveillance and monitoring in public spaces) where speech cannot be obtained for technical, legal or ethical issues. The results show that the gesturing measurement approach proposed in this work achieves, on a frame-by-frame basis, an accuracy of 71 percent in distinguishing between speech and non-speech.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20(2), 356–370 (2011)
Boersma, P.: Accurate short term analysis of the fundamental frequency and the harmonics to noise ratio of a sampled sound. IEEE Transactions on Image Processing 17, 97–110 (1993)
Boersma, P.: Praat, a system for doing phonetics by computer. Glot International 5(9/10), 341–345 (2001)
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., Achorn, B.: Modeling the interaction between speech and gesture. In: Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pp. 153–158 (1994)
Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Bue, A.D., Menegaz, G., Murino, V.: Social interaction discovery by statistical analysis of f-formations. In: Proceedings of the British Machine Vision Conference (2011)
Fisher, J.W., Freeman, W.T., Darrell, T., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Advanced in Neural Inf. Process. Syst., vol. 13, pp. 772–778 (2001)
Hung, H., Ba, S.O.: Speech/non-speech detection in meetings from automatically extracted low resolution visual features. In: ICASSP, pp. 830–833 (2010)
Hung, H., Huang, Y., Yeo, C., Gatica-Perez, D.: Associating audio-visual activity cues in a dominance estimation framework. In: First IEEE Workshop on CVPR for Human Communicative Behavior Analysis (2008)
Kendon, A.: Gesticulation and speech: Two aspects of the process of utterance. In: The Relationship of Verbal and Nonverbal Communication, pp. 207–227 (1980)
Kendon, A.: Language and gesture: unity or duality?, pp. 47–63. Cambridge University Press (2000)
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)
Khondaker, A., Ghulam, M.: Improved noise reduction with pitch enabled voice activity detection. In: ISIVC 2008 (2008)
Kopp, S., Wachsmuth, I.: Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15(1), 39–52 (2004)
McNeill, D.: Hand and mind: What gestures reveal about thought. Chicago University Press, Chicago (1992)
Noulas, A., Englebienne, G., Krose, B.J.A.: Multimodal speaker diarization. IEEE Transactions on Pattesrnss Analysis and Machine Intelligence 99 (2011)
Rao, R., Chen, T.: Cross-modal prediction in audio-visual communication. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1996, vol. 4, pp. 2056–2059 (1996)
Siracusa, M.R., John, W.F.: Dynamic dependency tests: Analysis and applications to multi-modal data association (2007)
Vajaria, H., Islam, T., Sarkar, S., Sankar, R., Kasturi, R.: Audio segmentation and speaker localization in meeting videos. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 2, pp. 1150–1153 (2006)
Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D’Errico, F., Schröder, M.: Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective Computing (2011) (to appear)
Wells, G., Petty, R.: The e_ects of over head movements on persuasion. Basic and Applied Social Psychology 1(3), 219–230 (1980)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M., Murino, V. (2012). Look at Who’s Talking: Voice Activity Detection by Automated Gesture Analysis. In: Wichert, R., Van Laerhoven, K., Gelissen, J. (eds) Constructing Ambient Intelligence. AmI 2011. Communications in Computer and Information Science, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31479-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-31479-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31478-0
Online ISBN: 978-3-642-31479-7
eBook Packages: Computer ScienceComputer Science (R0)