Skip to main content

Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment

  • Chapter
  • First Online:
Multidisciplinary Approaches to Neural Computing

Abstract

This paper focuses on employing Convolutional Neural Networks (CNN) with 3-D kernels for Voice Activity Detectors in multi-room domestic scenarios (mVAD). This technology is compared with the Multi Layer Perceptron (MLP) and interesting advancements are observed with respect to previous works of the authors. In order to approximate real-life scenarios, the DIRHA dataset is exploited. It has been recorded in a home environment by means of several microphones arranged in various rooms. Our study is composed by a multi-stage analysis focusing on the selection of the network size and the input microphones in relation with their number and position. Results are evaluated in terms of Speech Activity Detection error rate (SAD). The CNN-mVAD outperforms the other method with a significant solidity in terms of performance statistics, achieving in the best overall case a SAD equal to 7.0%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://dirha.fbk.eu/simcorpora.

  2. 2.

    http://keras.io/.

References

  1. Abad, A., Matos, M., Meinedo, H., Astudillo, R.F., Trancoso, I.: The L2F system for the EVALITA-2014 speech activity detection challenge in domestic environments. In: Proceedings of EVALITA, pp. 147–152 (2014)

    Google Scholar 

  2. Cristoforetti, L., Ravanelli, M., Omologo, M., Sosi, A., Abad, A., Hagmüller, M., Maragos, P.: The DIRHA simulated corpus. In: Proceedings of LREC, vol. 5. Reykjavik, Iceland (2014)

    Google Scholar 

  3. Ferroni, G., Bonfigli, R., Principi, E., Squartini, S., Piazza, F.: A deep neural network approach for voice activity detection in multi-room domestic scenarios. In: Proceedings of IJCNN, pp. 1–8. Killarney, Ireland (2015)

    Google Scholar 

  4. Gemmeke, J.F., Ons, B., Tessema, N., Van Hamme, H., van de Loo, J., De Pauw, G., Daelemans, W., Huyghe, J., Derboven, J., Vuegen, L., Van Den Broeck, B., Karsmakers, P., Vanrumste, B.: Self-taught assistive vocal interfaces: an overview of the ALADIN project. In: Proceedings of Interspeech, pp. 2039–2043. Lyon, France (2013)

    Google Scholar 

  5. Giannoulis, P., Tsiami, A., Rodomagoulakis, I., Katsamanis, A., Potamianos, G., Maragos, P.: The Athena-RC system for speech activity detection and speaker localization in the DIRHA smart home. In: Proceedings of HSCMA, 2014, pp. 167–171. Florence, Italy (2014)

    Google Scholar 

  6. Hussain, A., Chetouani, M., Squartini, S., Bastari, A., Piazza, F.: Nonlinear Speech Enhancement: An Overview, pp. 217–248. Springer, Berlin (2007)

    Google Scholar 

  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  8. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  9. Loizou, P.C.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton, FL (2013)

    Google Scholar 

  10. Lopes, N., Ribeiro, B.: Towards adaptive learning with improved convergence of deep belief networks on graphics processing units. Pattern Recogn. 47(1), 114–127 (2014)

    Article  Google Scholar 

  11. McLoughlin, I., Song, Y.: Low frequency ultrasonic voice activity detection using convolutional neural networks. In: Proceedings of Interspeech. Dresden, Germany (2015)

    Google Scholar 

  12. Mohamed, A., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of ICASSP, pp. 4273–4276. Kyoto, Japan (2012)

    Google Scholar 

  13. Morales-Cordovilla, J.A., Hagmuller, M., Pessentheiner, H., Kubin, G.: Distant speech recognition in reverberant noisy conditions employing a microphone array. In: Proceedings of EUSIPCO, pp. 2380–2384. Lisbona, Portugal (2014)

    Google Scholar 

  14. Price, R., Iso, K.I., Shinoda, K.: Wise teachers train better DNN acoustic models. Eurasip J. Audio Speech Music Process 2016(1) (2016)

    Google Scholar 

  15. Principi, E., Squartini, S., Bonfigli, R., Ferroni, G., Piazza, F.: An integrated system for voice command recognition and emergency detection based on audio signals. Expert Syst. Appl. 42(13), 5668–5683 (2015)

    Article  Google Scholar 

  16. Rotili, R., Principi, E., Squartini, S., Schuller, B.: A real-time speech enhancement framework in noisy and reverberated acoustic scenarios. Cogn. Comput. 5(4), 504–516 (2013)

    Article  Google Scholar 

  17. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)

    Article  Google Scholar 

  18. Thomas, S., Ganapathy, S., Saon, G., Soltau, H.: Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In: Proceedings of ICASSP, pp. 2519–2523. Florence, Italy (2014)

    Google Scholar 

  19. Ullrich, K., Schlüter, J., Grill, T.: Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of ISMIR, pp. 417–422. Taipei, Taiwan (2014)

    Google Scholar 

  20. Vacher, M., Caffiau, S., Portet, F., Meillon, B., Roux, C., Elias, E., Lecouteux, B., Chahuara, P.: Evaluation of a context-aware voice interface for ambient assisted living: qualitative user study vs. quantitative system evaluation. ACM Trans. Access. Comput. 7(2), 5:1–5:36 (2015)

    Google Scholar 

  21. Vesperini, F., Vecchiotti, P., Principi, E., Squartini, S., Piazza, F.: Deep neural networks for multi-room voice activity detection: advancements and comparative evaluation. In: Proceedings of IJCNN, pp. 3391–3398. Vancouver, Canada (2016)

    Google Scholar 

  22. Zhang, X.L., Wang, D.: Boosting contextual information for deep neural network based voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 24(2), 252–264 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Vecchiotti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Cite this chapter

Vecchiotti, P., Vesperini, F., Principi, E., Squartini, S., Piazza, F. (2018). Convolutional Neural Networks with 3-D Kernels for Voice Activity Detection in a Multiroom Environment. In: Esposito, A., Faudez-Zanuy, M., Morabito, F., Pasero, E. (eds) Multidisciplinary Approaches to Neural Computing. Smart Innovation, Systems and Technologies, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-319-56904-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56904-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56903-1

  • Online ISBN: 978-3-319-56904-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics