Skip to main content
Log in

A novel voice activity detection algorithm using modified global thresholding

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Voice activity detection is currently a challenging task that is applicable in real time applications such as speech coding and recognition. It is due to the low signal-to-noise ratio that affected the structural properties. Voice activity detection helps in detecting the speech region that is present in various nonstationary noises. The literature associated with Voice activity detection suggests that numerous works use unbalanced classification approach with higher and poor, speech and non-speech detection rates, respectively. This leads to the condition that majority of the noisy segments are categorized as speech. Hence, to overcome this issue, we propose a novel modified global thresholding scheme that has a fuzzy entropy tool. Our proposal can effectively identify both regions by locating the transition from non-speech to speech areas and vice versa. This will improve the detection rates as misclassification error of noisy segments as speech segments are minimized. The performance of the proposed algorithm is tested on various additive non-stationary noises at different SNR levels. In most of the existing research, it is often assumed that the noise is stationary for a particular instant in order to estimate the noise information. But in real scenario this is impossible. Our significant contribution is in developing an algorithm that handles the signals which possess nonstationary noises and various complex events which can be a mixture of different noises. As the characteristics of speech vary over time (nonstationary), when additively mixed with nonstationary noises becomes more challenging especially at low SNR levels (− 5 dB, − 10 dB). Therefore, the problem becomes more complicated like that in the real-time scenario. Our proposed method produces 91.98% and 87.38% of speech and non-speech detection rates in low SNR levels, respectively. It also obtains an accuracy of 93.39% for speech babble noises against the state-of-art algorithms which varied between 50 and 80% only. Similarly, NDS rates of the proposed algorithm is very minimal, i.e. less than 10% compared to the bench mark algorithms which had at least 50% or more of the noise detected as speech segments. The significance of our invention is in precisely locating where a speech begins and ends in a given noisy speech. We believe that we have produced a path breaking approach that can be helpful in real time applications in speech processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  • Beritelli, F., Casale, S., Ruggeri, G., & Serano, S. (2002). Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors. IEEE Signal Processing Letters, 9, 85–88.

    Article  Google Scholar 

  • Chen, W., Wang, Z., & Yu, W. (2007). Characterization of surface EMG signal based on fuzzy entropy. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 15, 266–272.

    Article  Google Scholar 

  • Chen, W., Zhuang, J., Wang, Z., & Yu, W. (2009). Measuring complexity using FuzzyEn, ApEn and SampEn. Medical Engineering & Physics, 31, 61–68.

    Article  Google Scholar 

  • Craciun, A., & Gabrea, M. (2004). Correlation coefficient-based voice activity detector algorithm. In Proceedings of Canadian conference on electrical and computer engineering (Vol. 3, pp. 1789–1792).

  • Dov, D., Talmon, R., & Cohen, I. (2016). Kernel method for voice activity detection in presence of transients. IEEE Transactions on Audio, Speech and Language Processing, 24(12), 2313–2326.

    Article  Google Scholar 

  • Elton, J., Vasuki, P., & Mohanalin, J. (2016). Voice activity detection using fuzzy entropy and support vector machine. Entropy, 18, 298. https://doi.org/10.3390/e18080298.

    Article  Google Scholar 

  • Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N., & Zue, V. (1993). TIMIT: Acoustic-phonetic continuous speech corpus. Philadelphia: Linguistic Data Consortium (1993).

  • Gelly, G., & Gauvain, J.-L. (2017). Optimization of RNN-based speech activity detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 646–656.

    Article  Google Scholar 

  • Haigh, J., & Mason, J. S. (1993). A voice activity detector based on cepstral analysis. In Proceedings of 3rd Eurospeech (pp. 1103–1106). Berlin, Germany.

  • Hirsch, H. G., & Pierce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noise conditions. In Proc. ISCA ITRW ASR automatic speech recognition: Challenges for the next millennium.

  • Ishizuka, K., Nakatani, T., Fujimoto, M., & Miyazaki, N. (2010). Noise robust voice activity detection based on periodic to aperiodic component ratio. Speech Communication, 52, 41–60. https://doi.org/10.1016/j.specom.2009.08.003.

    Article  Google Scholar 

  • ITU. (1996). A silence compression scheme for G.729 optimized for terminals conforming to recommendation, v.70. In ITU-T Recommendation G.729-Annex B.

  • Joseph, S. M., & Babu, A. P. (2016). Wavelet energy-based voice activity detection and adaptive thresholding for efficient speech coding. International Journal of Speech Technology, 19, 537–550.

    Article  Google Scholar 

  • Kenai, O., Ouamour, S., Guerti, M., et al. (2019). A new architecture based VAD for speaker diarization/detection systems. International Journal of Speech Technology, 22, 827–840.

    Article  Google Scholar 

  • Krishnan, P. S. H., Padmanabhan, R., & Murthy, H. A. (2007). Voice activity detection using group delay processing on buffered short-term energy. In Proceedings of 13th international conference on communications.

  • Lee, Y. C., & Ahn, S. S. (2006). Statistical model-based VAD algorithm with wavelet transform. IEICE Transactions on Fundamentals, E-89-A(6), 1594–1600.

    Article  Google Scholar 

  • Nemer, E., Goubran, R., & Mahmoud, S. (2001). Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Transactions on Audio, Speech and Language Processing, 9(3), 217–231.

    Article  Google Scholar 

  • Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.

    Article  Google Scholar 

  • Ramirez, J., Segura, J. C., Benitez, C., Torre, A., & Rubio, A. (2004). Efficient voice activity detection algorithms using long-term speech information. Speech Communications, 42(3–4), 271–287.

    Article  Google Scholar 

  • Sehgal, A., & Kehtarnavaz, N. (2018). A convolutional neural network smartphone app for real-time voice activity detection. IEEE Access, 6, 9017–9026.

    Article  Google Scholar 

  • Silva, D. A., Stuchi, J. A., Violato, R. P. V., & Cuozzo, L. G. D. (2017). Exploring convolutional neural networks for voice activity detection. In Cognitive technologies. Cham: Springer.

  • Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.

    Article  Google Scholar 

  • Tan, L. N., Borgstrom, B. J., & Alwan, A. (2010). Voice activity detection using harmonic frequency components in likelihood ratio test. In IEEE International conference on acoustics, speech, and signal processing (ICASSP).

  • Tan, Z., & Kraljevski, I. (2014). Joint variable frame rate and length analysis for speech recognition under adverse conditions. Computers & Electrical Engineering, 40, 3139–3149.

    Article  Google Scholar 

  • Zhang, X.-L., & Wu, J. (2012). Deep belief networks based voice activity detection. IEEE Transactions on Audio, Speech and Language Processing, 21(4), 697–710.

    Article  Google Scholar 

  • Zoulikha, M., & Djendi, M. (2018). A new robust forward BSS adaptive algorithm based on automatic voice activity detector for speech quality enhancement. International Journal of Speech Technology, 21, 1007–1020.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Johny Elton.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elton, R.J., Mohanalin, J. & Vasuki, P. A novel voice activity detection algorithm using modified global thresholding. Int J Speech Technol 24, 127–142 (2021). https://doi.org/10.1007/s10772-020-09777-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-020-09777-w

Keywords

Navigation