Abstract
This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real-life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in various rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statistics, achieving in the best overall case a SAD equal to 7.0%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Abad, A., Matos, M., Meinedo, H., Astudillo, R.F., Trancoso, I.: The L2F system for the EVALITA-2014 speech activity detection challenge in domestic environments. In: Proceedings of EVALITA, pp. 147–152 (2014)
Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmüller, M., Maragos, P.: The DIRHA simulated corpus. In: Proceedings of LREC, vol. 5. Reykjavik, Iceland (2014)
Ferroni, G., Bonfigli, R., Principi, E., Squartini, S., Piazza, F.: A deep neural network approach for voice activity detection in multi-room domestic scenarios. In: Proceedings of IJCNN, pp. 1–8. Killarney, Ireland (2015)
Gemmeke, J.F., Ons, B., Tessema, N., Van Hamme, H., van de Loo, J., De Pauw, G., Daelemans, W., Huyghe, J., Derboven, J., Vuegen, L., Van Den Broeck, B., Karsmakers, P., Vanrumste, B.: Self-taught assistive vocal interfaces: an overview of the ALADIN project. In: Proceedings of Interspeech, pp. 2039–2043. Lyon, France (2013)
Giannoulis, P., Tsiami, A., Rodomagoulakis, I., Katsamanis, A., Potamianos, G., Maragos, P.: The Athena-RC system for speech activity detection and speaker localization in the DIRHA smart home. In: Proceedings of HSCMA, 2014, pp. 167–171. Florence, Italy (2014)
Hussain, A., Chetouani, M., Squartini, S., Bastari, A., Piazza, F.: Nonlinear Speech Enhancement: An Overview, pp. 217–248. Springer, Berlin (2007)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL (2013)
Lopes, N., Ribeiro, B.: Towards adaptive learning with improved convergence of deep belief networks on graphics processing units. Pattern Recogn. 47(1), 114–127 (2014)
McLoughlin, I., Song, Y.: Low frequency ultrasonic voice activity detection using convolutional neural networks. In: Proceedings of Interspeech. Dresden, Germany (2015)
Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of ICASSP, pp. 4273–4276. Kyoto, Japan (2012)
Morales-Cordovilla, J.A., Hagmuller, M., Pessentheiner, H., Kubin, G.: Distant speech recognition in reverberant noisy conditions employing a microphone array. In: Proceedings of EUSIPCO, pp. 2380–2384. Lisbona, Portugal (2014)
Price, R., Iso, K.I., Shinoda, K.: Wise teachers train better DNN acoustic models. Eurasip J. Audio Speech Music Process 2016(1) (2016)
Principi, E., Squartini, S., Bonfigli, R., Ferroni, G., Piazza, F.: An integrated system for voice command recognition and emergency detection based on audio signals. Expert Syst. Appl. 42(13), 5668–5683 (2015)
Rotili, R., Principi, E., Squartini, S., Schuller, B.: A real-time speech enhancement framework in noisy and reverberated acoustic scenarios. Cogn. Comput. 5(4), 504–516 (2013)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Thomas, S., Ganapathy, S., Saon, G., Soltau, H.: Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In: Proceedings of ICASSP, pp. 2519–2523. Florence, Italy (2014)
Ullrich, K., Schlüter, J., Grill, T.: Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of ISMIR, pp. 417–422. Taipei, Taiwan (2014)
Vacher, M., Caffiau, S., Portet, F., Meillon, B., Roux, C., Elias, E., Lecouteux, B., Chahuara, P.: Evaluation of a context-aware voice interface for ambient assisted living: qualitative user study vs. quantitative system evaluation. ACM Trans. Access. Comput. 7(2), 5:1–5:36 (2015)
Vesperini, F., Vecchiotti, P., Principi, E., Squartini, S., Piazza, F.: Deep neural networks for multi-room voice activity detection: advancements and comparative evaluation. In: Proceedings of IJCNN, pp. 3391–3398. Vancouver, Canada (2016)
Zhang, X.L., Wang, D.: Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Vecchiotti, P., Vesperini, F., Principi, E., Squartini, S., Piazza, F. (2018). Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment. In: Esposito, A., Faudez-Zanuy, M., Morabito, F., Pasero, E. (eds) Multidisciplinary Approaches to Neural Computing. Smart Innovation, Systems and Technologies, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-319-56904-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-56904-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56903-1
Online ISBN: 978-3-319-56904-8
eBook Packages: EngineeringEngineering (R0)