Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment

Vecchiotti, Paolo; Vesperini, Fabio; Principi, Emanuele; Squartini, Stefano; Piazza, Francesco

doi:10.1007/978-3-319-56904-8_16

Paolo Vecchiotti⁷,
Fabio Vesperini⁷,
Emanuele Principi⁷,
Stefano Squartini⁷ &
…
Francesco Piazza⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 69))

1499 Accesses
9 Citations

Abstract

This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real-life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in various rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statistics, achieving in the best overall case a SAD equal to 7.0%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lightweight CNN for Robust Voice Activity Detection

Exploring Convolutional Neural Networks for Voice Activity Detection

Voice Activity Detection Using Convolutional Recurrent Neural Networks

Notes

1.
http://dirha.fbk.eu/simcorpora.
2.
http://keras.io/.

References

Abad, A., Matos, M., Meinedo, H., Astudillo, R.F., Trancoso, I.: The L2F system for the EVALITA-2014 speech activity detection challenge in domestic environments. In: Proceedings of EVALITA, pp. 147–152 (2014)
Google Scholar
Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmüller, M., Maragos, P.: The DIRHA simulated corpus. In: Proceedings of LREC, vol. 5. Reykjavik, Iceland (2014)
Google Scholar
Ferroni, G., Bonfigli, R., Principi, E., Squartini, S., Piazza, F.: A deep neural network approach for voice activity detection in multi-room domestic scenarios. In: Proceedings of IJCNN, pp. 1–8. Killarney, Ireland (2015)
Google Scholar
Gemmeke, J.F., Ons, B., Tessema, N., Van Hamme, H., van de Loo, J., De Pauw, G., Daelemans, W., Huyghe, J., Derboven, J., Vuegen, L., Van Den Broeck, B., Karsmakers, P., Vanrumste, B.: Self-taught assistive vocal interfaces: an overview of the ALADIN project. In: Proceedings of Interspeech, pp. 2039–2043. Lyon, France (2013)
Google Scholar
Giannoulis, P., Tsiami, A., Rodomagoulakis, I., Katsamanis, A., Potamianos, G., Maragos, P.: The Athena-RC system for speech activity detection and speaker localization in the DIRHA smart home. In: Proceedings of HSCMA, 2014, pp. 167–171. Florence, Italy (2014)
Google Scholar
Hussain, A., Chetouani, M., Squartini, S., Bastari, A., Piazza, F.: Nonlinear Speech Enhancement: An Overview, pp. 217–248. Springer, Berlin (2007)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL (2013)
Google Scholar
Lopes, N., Ribeiro, B.: Towards adaptive learning with improved convergence of deep belief networks on graphics processing units. Pattern Recogn. 47(1), 114–127 (2014)
Article Google Scholar
McLoughlin, I., Song, Y.: Low frequency ultrasonic voice activity detection using convolutional neural networks. In: Proceedings of Interspeech. Dresden, Germany (2015)
Google Scholar
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of ICASSP, pp. 4273–4276. Kyoto, Japan (2012)
Google Scholar
Morales-Cordovilla, J.A., Hagmuller, M., Pessentheiner, H., Kubin, G.: Distant speech recognition in reverberant noisy conditions employing a microphone array. In: Proceedings of EUSIPCO, pp. 2380–2384. Lisbona, Portugal (2014)
Google Scholar
Price, R., Iso, K.I., Shinoda, K.: Wise teachers train better DNN acoustic models. Eurasip J. Audio Speech Music Process 2016(1) (2016)
Google Scholar
Principi, E., Squartini, S., Bonfigli, R., Ferroni, G., Piazza, F.: An integrated system for voice command recognition and emergency detection based on audio signals. Expert Syst. Appl. 42(13), 5668–5683 (2015)
Article Google Scholar
Rotili, R., Principi, E., Squartini, S., Schuller, B.: A real-time speech enhancement framework in noisy and reverberated acoustic scenarios. Cogn. Comput. 5(4), 504–516 (2013)
Article Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Article Google Scholar
Thomas, S., Ganapathy, S., Saon, G., Soltau, H.: Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In: Proceedings of ICASSP, pp. 2519–2523. Florence, Italy (2014)
Google Scholar
Ullrich, K., Schlüter, J., Grill, T.: Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of ISMIR, pp. 417–422. Taipei, Taiwan (2014)
Google Scholar
Vacher, M., Caffiau, S., Portet, F., Meillon, B., Roux, C., Elias, E., Lecouteux, B., Chahuara, P.: Evaluation of a context-aware voice interface for ambient assisted living: qualitative user study vs. quantitative system evaluation. ACM Trans. Access. Comput. 7(2), 5:1–5:36 (2015)
Google Scholar
Vesperini, F., Vecchiotti, P., Principi, E., Squartini, S., Piazza, F.: Deep neural networks for multi-room voice activity detection: advancements and comparative evaluation. In: Proceedings of IJCNN, pp. 3391–3398. Vancouver, Canada (2016)
Google Scholar
Zhang, X.L., Wang, D.: Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, Università Politecnica delle Marche, Via Brecce Bianche, 60131, Ancona, Italy
Paolo Vecchiotti, Fabio Vesperini, Emanuele Principi, Stefano Squartini & Francesco Piazza

Authors

Paolo Vecchiotti
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Vesperini
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Principi
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Squartini
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Piazza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Vecchiotti .

Editor information

Editors and Affiliations

Dipartimento di Psicologia, Università della Campania “Luigi Vanvitelli”, Caserta, Italy
Anna Esposito
Fundació Tecnocampus, Pompeu Fabra University, Mataro, Spain
Marcos Faudez-Zanuy
Department of Civil, Environmental, Energy, and Material Engineering, Mediterranea University of Reggio Calabria, Reggio Calabria, Italy
Francesco Carlo Morabito
Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Laboratorio di Neuronica, Torino, Italy
Eros Pasero

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Vecchiotti, P., Vesperini, F., Principi, E., Squartini, S., Piazza, F. (2018). Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment. In: Esposito, A., Faudez-Zanuy, M., Morabito, F., Pasero, E. (eds) Multidisciplinary Approaches to Neural Computing. Smart Innovation, Systems and Technologies, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-319-56904-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-56904-8_16
Published: 30 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56903-1
Online ISBN: 978-3-319-56904-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics