Look at Who’s Talking: Voice Activity Detection by Automated Gesture Analysis

Cristani, Marco; Pesarin, Anna; Vinciarelli, Alessandro; Crocco, Marco; Murino, Vittorio

doi:10.1007/978-3-642-31479-7_14

Look at Who’s Talking: Voice Activity Detection by Automated Gesture Analysis

Marco Cristani^4,5,
Anna Pesarin⁴,
Alessandro Vinciarelli^6,7,
Marco Crocco⁵ &
…
Vittorio Murino^4,5

Conference paper

1740 Accesses
11 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 277))

Abstract

This paper proposes an approach for Voice Activity Detection (VAD) based on the automatic measurement of gesturing. The main motivation of the work is that gestures have been shown to be tightly correlated with speech, hence they can be considered a reliable evidence that a person is talking. The use of gestures rather than speech for performing VAD can be helpful in many situation (e.g., surveillance and monitoring in public spaces) where speech cannot be obtained for technical, legal or ethical issues. The results show that the gesturing measurement approach proposed in this work achieves, on a frame-by-frame basis, an accuracy of 71 percent in distinguishing between speech and non-speech.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing 20(2), 356–370 (2011)
Google Scholar
Boersma, P.: Accurate short term analysis of the fundamental frequency and the harmonics to noise ratio of a sampled sound. IEEE Transactions on Image Processing 17, 97–110 (1993)
Google Scholar
Boersma, P.: Praat, a system for doing phonetics by computer. Glot International 5(9/10), 341–345 (2001)
Google Scholar
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S., Achorn, B.: Modeling the interaction between speech and gesture. In: Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pp. 153–158 (1994)
Google Scholar
Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Bue, A.D., Menegaz, G., Murino, V.: Social interaction discovery by statistical analysis of f-formations. In: Proceedings of the British Machine Vision Conference (2011)
Google Scholar
Fisher, J.W., Freeman, W.T., Darrell, T., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Advanced in Neural Inf. Process. Syst., vol. 13, pp. 772–778 (2001)
Google Scholar
Hung, H., Ba, S.O.: Speech/non-speech detection in meetings from automatically extracted low resolution visual features. In: ICASSP, pp. 830–833 (2010)
Google Scholar
Hung, H., Huang, Y., Yeo, C., Gatica-Perez, D.: Associating audio-visual activity cues in a dominance estimation framework. In: First IEEE Workshop on CVPR for Human Communicative Behavior Analysis (2008)
Google Scholar
Kendon, A.: Gesticulation and speech: Two aspects of the process of utterance. In: The Relationship of Verbal and Nonverbal Communication, pp. 207–227 (1980)
Google Scholar
Kendon, A.: Language and gesture: unity or duality?, pp. 47–63. Cambridge University Press (2000)
Google Scholar
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)
Google Scholar
Khondaker, A., Ghulam, M.: Improved noise reduction with pitch enabled voice activity detection. In: ISIVC 2008 (2008)
Google Scholar
Kopp, S., Wachsmuth, I.: Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15(1), 39–52 (2004)
Article Google Scholar
McNeill, D.: Hand and mind: What gestures reveal about thought. Chicago University Press, Chicago (1992)
Google Scholar
Noulas, A., Englebienne, G., Krose, B.J.A.: Multimodal speaker diarization. IEEE Transactions on Pattesrnss Analysis and Machine Intelligence 99 (2011)
Google Scholar
Rao, R., Chen, T.: Cross-modal prediction in audio-visual communication. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1996, vol. 4, pp. 2056–2059 (1996)
Google Scholar
Siracusa, M.R., John, W.F.: Dynamic dependency tests: Analysis and applications to multi-modal data association (2007)
Google Scholar
Vajaria, H., Islam, T., Sarkar, S., Sankar, R., Kasturi, R.: Audio segmentation and speaker localization in meeting videos. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 2, pp. 1150–1153 (2006)
Google Scholar
Vinciarelli, A., Pantic, M., Heylen, D., Pelachaud, C., Poggi, I., D’Errico, F., Schröder, M.: Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective Computing (2011) (to appear)
Google Scholar
Wells, G., Petty, R.: The e_ects of over head movements on persuasion. Basic and Applied Social Psychology 1(3), 219–230 (1980)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, University of Verona, Italy
Marco Cristani, Anna Pesarin & Vittorio Murino
Istituto Italiano di Tecnologia, Italy
Marco Cristani, Marco Crocco & Vittorio Murino
University of Glasgow, UK
Alessandro Vinciarelli
Idiap Research Institute, Switzerland
Alessandro Vinciarelli

Authors

Marco Cristani
View author publications
You can also search for this author in PubMed Google Scholar
Anna Pesarin
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Vinciarelli
View author publications
You can also search for this author in PubMed Google Scholar
Marco Crocco
View author publications
You can also search for this author in PubMed Google Scholar
Vittorio Murino
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fraunhofer-Institut Graphische Datenverarbeitung, Fraunhoferstr. 5, 64283, Darmstadt, Germany
Reiner Wichert
Technical University Darmstadt, Hochschulstr. 10, 64289, Darmstadt, Germany
Kristof Van Laerhoven
Philips Research, High Tech Campus 34 (5.058), 5656 AE, Eindhoven, The Netherlands
Jean Gelissen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cristani, M., Pesarin, A., Vinciarelli, A., Crocco, M., Murino, V. (2012). Look at Who’s Talking: Voice Activity Detection by Automated Gesture Analysis. In: Wichert, R., Van Laerhoven, K., Gelissen, J. (eds) Constructing Ambient Intelligence. AmI 2011. Communications in Computer and Information Science, vol 277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31479-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-31479-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31478-0
Online ISBN: 978-3-642-31479-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics