Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments

Morita, Shota; Unoki, Masashi; Lu, Xugang; Akagi, Masato

doi:10.1007/s11265-015-1014-4

Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments

Published: 11 June 2015

Volume 82, pages 163–173, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Shota Morita¹,
Masashi Unoki¹,
Xugang Lu² &
…
Masato Akagi¹

528 Accesses
3 Altmetric
Explore all metrics

Abstract

Voice activity detection (VAD) is used to detect speech and non-speech periods from observed speech signals. It is an important front-end technique for many speech technology applications. Many VAD methods have been proposed. However most of them have been applied under clean or noisy conditions. Only a few methods have been proposed for reverberant conditions, particularly under noisy reverberant conditions. We therefore need to understand the ill effects of noise and reverberation on speech to design an accurate and robust method of VAD under noisy reverberant conditions. The ill effects of noise and reverberation for speech can be regarded as the modulation transfer function (MTF) under noisy and reverberant conditions. Therefore, our study is based on the MTF concept to reduce the ill effects of noise and reverberation on speech, and propose a robust VAD method that we obtained in this study. Noise reduction and dereverberation were first applied to the temporal power envelope of the speech signal to restore the temporal power envelope with this method. Then, power thresholding as a VAD decision was designed based on the restored temporal power envelope. A method of estimating the signal to noise ratio (SNR) was proposed to accurately estimate the SNR in the noise reduction stage. Experiments under both artificial and realistic noisy reverberant conditions were carried out to evaluate the performance of the proposed method of VAD and it was compared with conventional VAD methods. The results revealed that the proposed method significantly outperformed the conventional methods under artificial and realistic noisy reverberant conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A technique for noise robust voice activity detection under uncontrolled environment

Article 01 August 2024

Performance Analysis of Voice Activity Detector in Presence of Non-stationary Noise

Enhancement of speech dynamics for voice activity detection using DNN

Article Open access 12 September 2018

References

Ramirez, J., Gorriz, J.M., & Segura, J.C. (2007). Voice activity detection fundamentals and speech recognition system robustness. Robust Speech Recognition and Understanding, 1–22.
Kitaoka, N., Yamada, T., Tsuge, S., Miyajima, C., Yamamoto, K., Nishiura, T., Nakayama, M., Denda, Y., Fujimoto, M., Takiguchi, T., Tamura, S., Matsuda, S., Ogawa, T., Kuroiwa, S., Takeda, K., & Nakamura, S. (2009). CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments. Acoustical Science and Technology, 30(5), 363–371.
Article Google Scholar
Benyassine, A., Shlomot, E., Huan-yu, S., Massaloux, D., Lamblin, C., & Petit, J.P. (1997). ITU-T recommendation G.729 annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data application. IEEE Communications Magazine, 35, 64–73.
Article Google Scholar
Lu, X., Unoki, M., Isotani, R., Kawai, H., & Nakamura, S. (2011). Adaptive regularization framework for robust voice activity detection. In Proceedings Interspeech2011 (pp. 2653–2653).
ETSI EN 301 v7.1 (1999). Digital cellular telecommunications system; Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) speech traffic channels.
Kanai, Y., Morita, S., & Unoki, M. (2013). Concurrent processing of voice activity detection and noise reduction using empirical mode decomposition and modulation spectrum analysis. In Proceedings Interspeech2013 (pp. 742–746).
Fukuda, T., Ichikawa, O., & Nishimura, M. (2010). Long-term spectro-temporal and static harmonic features for voice activity detection. IEEE Journal of Selected Topics in Signal Processing, 4, 834–844.
Article Google Scholar
Varela, Ó., San-Segundo, R., & Hernández, L. (2011). Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector. Computers and Electrical Engineering, 37, 589–600.
Article Google Scholar
Otsu, N. (1979). A threshold selection method from graylevel histograms. IEEE Transactions on Systems, Man, and Cybernetics, SMC-9, 61–66.
Unoki, M., Lu, X., Petrick, R., Morita, S., Akagi, M., & Hoffmann, R. (2011). Voice activity detection in MTF-based power envelope restoration. In Proceedings Interspeech2011 (pp. 2609–2612).
Unoki, M., Furukawa, M., Sakata, K., & Akagi, M. (2004). An improved method based on the MTF concept for restoring the power envelope from a reverberant signal. Acoustical Science and Technology, 25(4), 232–242.
Article Google Scholar
Unoki, M., Sakata, K., Furukawa, M., & Akagi, M. (2004). A speech dereverberation method based on the MTF concept in power envelope restoration. Acoustical Science and Technology, 25(4), 243–254.
Article Google Scholar
Houtgast, T., & Steeneken, H.J. (1973). The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica, 28, 66–73.
Google Scholar
Unoki, M., Yamasaki, Y., & Akagi, M. (2009). MTF-based power envelope restoration in noisy reverberant environments. In Proceedings EUSIPCO, (Vol. 2009 pp. 228–232).
Unoki, M., & Lu, X. (2012). Unified denoising and dereverberation method used in restoration of MTF-based power envelope. In Proceedings ISCSLP, (pp. 215–219).
Schroeder, M.R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 489.
http://www.slp.cs.tut.ac.jp/CENSREC/en/CENSREC/AURORA-2J/ (2012).
Hirsch, H.G., & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings ISCA ITRW ASR2000. Automatic speech recognition: challenges for the next millennium.
Architectual Institute of Japan. (2004). Sound library of architecture and environment. Tokyo: Gihodo Shuppan Co, Ltd.
Google Scholar
Kawai, K., Fujimoto, K., Iwase, T., Yasuoka, H., Sakuma, T., & Hidaka, Y. (2004). Development of a sound source database for environmental/architectural acoustics: Introduction of SMILE 2004 (Sound Material in Living Environment 2004). In Proceedings ICA (pp. 1561–1564).
Varga, A., & Steeneken, H.J.M. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12 (13), 247–251.
Article Google Scholar

Download references

Acknowledgements

This work was supported by an A3 foresight program made available by the Japan Society for the Promotion of Science and by the Strategic Information and Communications R & D Promotion Programme (SCOPE: 131205001) of the Ministry of Internal Affairs and Communications (MIC), Japan. This study was also supported by the Grant-in-Aid for Scientific Research (A) (No. 25240026) and by Secom Science and Technology Foundation.

Author information

Authors and Affiliations

School of Information Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan
Shota Morita, Masashi Unoki & Masato Akagi
Universal Communication Research Institute, National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289, Japan
Xugang Lu

Authors

Shota Morita
View author publications
You can also search for this author inPubMed Google Scholar
Masashi Unoki
View author publications
You can also search for this author inPubMed Google Scholar
Xugang Lu
View author publications
You can also search for this author inPubMed Google Scholar
Masato Akagi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Shota Morita.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morita, S., Unoki, M., Lu, X. et al. Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments. J Sign Process Syst 82, 163–173 (2016). https://doi.org/10.1007/s11265-015-1014-4

Download citation

Received: 15 November 2014
Revised: 30 April 2015
Accepted: 07 May 2015
Published: 11 June 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-1014-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A technique for noise robust voice activity detection under uncontrolled environment

Performance Analysis of Voice Activity Detector in Presence of Non-stationary Noise

Enhancement of speech dynamics for voice activity detection using DNN

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now