Abstract
Voice activity detection (VAD) is used to detect speech and non-speech periods from observed speech signals. It is an important front-end technique for many speech technology applications. Many VAD methods have been proposed. However most of them have been applied under clean or noisy conditions. Only a few methods have been proposed for reverberant conditions, particularly under noisy reverberant conditions. We therefore need to understand the ill effects of noise and reverberation on speech to design an accurate and robust method of VAD under noisy reverberant conditions. The ill effects of noise and reverberation for speech can be regarded as the modulation transfer function (MTF) under noisy and reverberant conditions. Therefore, our study is based on the MTF concept to reduce the ill effects of noise and reverberation on speech, and propose a robust VAD method that we obtained in this study. Noise reduction and dereverberation were first applied to the temporal power envelope of the speech signal to restore the temporal power envelope with this method. Then, power thresholding as a VAD decision was designed based on the restored temporal power envelope. A method of estimating the signal to noise ratio (SNR) was proposed to accurately estimate the SNR in the noise reduction stage. Experiments under both artificial and realistic noisy reverberant conditions were carried out to evaluate the performance of the proposed method of VAD and it was compared with conventional VAD methods. The results revealed that the proposed method significantly outperformed the conventional methods under artificial and realistic noisy reverberant conditions.











Similar content being viewed by others
References
Ramirez, J., Gorriz, J.M., & Segura, J.C. (2007). Voice activity detection fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding, 1–22.
Kitaoka, N., Yamada, T., Tsuge, S., Miyajima, C., Yamamoto, K., Nishiura, T., Nakayama, M., Denda, Y., Fujimoto, M., Takiguchi, T., Tamura, S., Matsuda, S., Ogawa, T., Kuroiwa, S., Takeda, K., & Nakamura, S. (2009). CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments. Acoustical Science and Technology, 30(5), 363–371.
Benyassine, A., Shlomot, E., Huan-yu, S., Massaloux, D., Lamblin, C., & Petit, J.P. (1997). ITU-T recommendation G.729 annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data application. IEEE Communications Magazine, 35, 64–73.
Lu, X., Unoki, M., Isotani, R., Kawai, H., & Nakamura, S. (2011). Adaptive regularization framework for robust voice activity detection. In Proceedings Interspeech2011 (pp. 2653–2653).
ETSI EN 301 v7.1 (1999). Digital cellular telecommunications system; Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) speech traffic channels.
Kanai, Y., Morita, S., & Unoki, M. (2013). Concurrent processing of voice activity detection and noise reduction using empirical mode decomposition and modulation spectrum analysis. In Proceedings Interspeech2013 (pp. 742–746).
Fukuda, T., Ichikawa, O., & Nishimura, M. (2010). Long-term spectro-temporal and static harmonic features for voice activity detection. IEEE Journal of Selected Topics in Signal Processing, 4, 834–844.
Varela, Ó., San-Segundo, R., & Hernández, L. (2011). Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector. Computers and Electrical Engineering, 37, 589–600.
Otsu, N. (1979). A threshold selection method from graylevel histograms. IEEE Transactions on Systems, Man, and Cybernetics, SMC-9, 61–66.
Unoki, M., Lu, X., Petrick, R., Morita, S., Akagi, M., & Hoffmann, R. (2011). Voice activity detection in MTF-based power envelope restoration. In Proceedings Interspeech2011 (pp. 2609–2612).
Unoki, M., Furukawa, M., Sakata, K., & Akagi, M. (2004). An improved method based on the MTF concept for restoring the power envelope from a reverberant signal. Acoustical Science and Technology, 25(4), 232–242.
Unoki, M., Sakata, K., Furukawa, M., & Akagi, M. (2004). A speech dereverberation method based on the MTF concept in power envelope restoration. Acoustical Science and Technology, 25(4), 243–254.
Houtgast, T., & Steeneken, H.J. (1973). The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica, 28, 66–73.
Unoki, M., Yamasaki, Y., & Akagi, M. (2009). MTF-based power envelope restoration in noisy reverberant environments. In Proceedings EUSIPCO, (Vol. 2009 pp. 228–232).
Unoki, M., & Lu, X. (2012). Unified denoising and dereverberation method used in restoration of MTF-based power envelope. In Proceedings ISCSLP, (pp. 215–219).
Schroeder, M.R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 489.
http://www.slp.cs.tut.ac.jp/CENSREC/en/CENSREC/AURORA-2J/ (2012).
Hirsch, H.G., & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings ISCA ITRW ASR2000. Automatic speech recognition: challenges for the next millennium.
Architectual Institute of Japan. (2004). Sound library of architecture and environment. Tokyo: Gihodo Shuppan Co, Ltd.
Kawai, K., Fujimoto, K., Iwase, T., Yasuoka, H., Sakuma, T., & Hidaka, Y. (2004). Development of a sound source database for environmental/architectural acoustics: Introduction of SMILE 2004 (Sound Material in Living Environment 2004). In Proceedings ICA (pp. 1561–1564).
Varga, A., & Steeneken, H.J.M. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12 (13), 247–251.
Acknowledgements
This work was supported by an A3 foresight program made available by the Japan Society for the Promotion of Science and by the Strategic Information and Communications R & D Promotion Programme (SCOPE: 131205001) of the Ministry of Internal Affairs and Communications (MIC), Japan. This study was also supported by the Grant-in-Aid for Scientific Research (A) (No. 25240026) and by Secom Science and Technology Foundation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Morita, S., Unoki, M., Lu, X. et al. Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments. J Sign Process Syst 82, 163–173 (2016). https://doi.org/10.1007/s11265-015-1014-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1014-4