Skip to main content
Log in

Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Voice activity detection (VAD) is used to detect speech and non-speech periods from observed speech signals. It is an important front-end technique for many speech technology applications. Many VAD methods have been proposed. However most of them have been applied under clean or noisy conditions. Only a few methods have been proposed for reverberant conditions, particularly under noisy reverberant conditions. We therefore need to understand the ill effects of noise and reverberation on speech to design an accurate and robust method of VAD under noisy reverberant conditions. The ill effects of noise and reverberation for speech can be regarded as the modulation transfer function (MTF) under noisy and reverberant conditions. Therefore, our study is based on the MTF concept to reduce the ill effects of noise and reverberation on speech, and propose a robust VAD method that we obtained in this study. Noise reduction and dereverberation were first applied to the temporal power envelope of the speech signal to restore the temporal power envelope with this method. Then, power thresholding as a VAD decision was designed based on the restored temporal power envelope. A method of estimating the signal to noise ratio (SNR) was proposed to accurately estimate the SNR in the noise reduction stage. Experiments under both artificial and realistic noisy reverberant conditions were carried out to evaluate the performance of the proposed method of VAD and it was compared with conventional VAD methods. The results revealed that the proposed method significantly outperformed the conventional methods under artificial and realistic noisy reverberant conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others

References

  1. Ramirez, J., Gorriz, J.M., & Segura, J.C. (2007). Voice activity detection fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding, 1–22.

  2. Kitaoka, N., Yamada, T., Tsuge, S., Miyajima, C., Yamamoto, K., Nishiura, T., Nakayama, M., Denda, Y., Fujimoto, M., Takiguchi, T., Tamura, S., Matsuda, S., Ogawa, T., Kuroiwa, S., Takeda, K., & Nakamura, S. (2009). CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments. Acoustical Science and Technology, 30(5), 363–371.

    Article  Google Scholar 

  3. Benyassine, A., Shlomot, E., Huan-yu, S., Massaloux, D., Lamblin, C., & Petit, J.P. (1997). ITU-T recommendation G.729 annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data application. IEEE Communications Magazine, 35, 64–73.

    Article  Google Scholar 

  4. Lu, X., Unoki, M., Isotani, R., Kawai, H., & Nakamura, S. (2011). Adaptive regularization framework for robust voice activity detection. In Proceedings Interspeech2011 (pp. 2653–2653).

  5. ETSI EN 301 v7.1 (1999). Digital cellular telecommunications system; Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) speech traffic channels.

  6. Kanai, Y., Morita, S., & Unoki, M. (2013). Concurrent processing of voice activity detection and noise reduction using empirical mode decomposition and modulation spectrum analysis. In Proceedings Interspeech2013 (pp. 742–746).

  7. Fukuda, T., Ichikawa, O., & Nishimura, M. (2010). Long-term spectro-temporal and static harmonic features for voice activity detection. IEEE Journal of Selected Topics in Signal Processing, 4, 834–844.

    Article  Google Scholar 

  8. Varela, Ó., San-Segundo, R., & Hernández, L. (2011). Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector. Computers and Electrical Engineering, 37, 589–600.

    Article  Google Scholar 

  9. Otsu, N. (1979). A threshold selection method from graylevel histograms. IEEE Transactions on Systems, Man, and Cybernetics, SMC-9, 61–66.

  10. Unoki, M., Lu, X., Petrick, R., Morita, S., Akagi, M., & Hoffmann, R. (2011). Voice activity detection in MTF-based power envelope restoration. In Proceedings Interspeech2011 (pp. 2609–2612).

  11. Unoki, M., Furukawa, M., Sakata, K., & Akagi, M. (2004). An improved method based on the MTF concept for restoring the power envelope from a reverberant signal. Acoustical Science and Technology, 25(4), 232–242.

    Article  Google Scholar 

  12. Unoki, M., Sakata, K., Furukawa, M., & Akagi, M. (2004). A speech dereverberation method based on the MTF concept in power envelope restoration. Acoustical Science and Technology, 25(4), 243–254.

    Article  Google Scholar 

  13. Houtgast, T., & Steeneken, H.J. (1973). The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica, 28, 66–73.

    Google Scholar 

  14. Unoki, M., Yamasaki, Y., & Akagi, M. (2009). MTF-based power envelope restoration in noisy reverberant environments. In Proceedings EUSIPCO, (Vol. 2009 pp. 228–232).

  15. Unoki, M., & Lu, X. (2012). Unified denoising and dereverberation method used in restoration of MTF-based power envelope. In Proceedings ISCSLP, (pp. 215–219).

  16. Schroeder, M.R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 489.

  17. http://www.slp.cs.tut.ac.jp/CENSREC/en/CENSREC/AURORA-2J/ (2012).

  18. Hirsch, H.G., & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings ISCA ITRW ASR2000. Automatic speech recognition: challenges for the next millennium.

  19. Architectual Institute of Japan. (2004). Sound library of architecture and environment. Tokyo: Gihodo Shuppan Co, Ltd.

    Google Scholar 

  20. Kawai, K., Fujimoto, K., Iwase, T., Yasuoka, H., Sakuma, T., & Hidaka, Y. (2004). Development of a sound source database for environmental/architectural acoustics: Introduction of SMILE 2004 (Sound Material in Living Environment 2004). In Proceedings ICA (pp. 1561–1564).

  21. Varga, A., & Steeneken, H.J.M. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12 (13), 247–251.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by an A3 foresight program made available by the Japan Society for the Promotion of Science and by the Strategic Information and Communications R & D Promotion Programme (SCOPE: 131205001) of the Ministry of Internal Affairs and Communications (MIC), Japan. This study was also supported by the Grant-in-Aid for Scientific Research (A) (No. 25240026) and by Secom Science and Technology Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shota Morita.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Morita, S., Unoki, M., Lu, X. et al. Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments. J Sign Process Syst 82, 163–173 (2016). https://doi.org/10.1007/s11265-015-1014-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1014-4

Keywords

Navigation