Skip to main content
Log in

An Adaptive Method for Robust Detection of Vowels in Noisy Environment

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Automatic detection of vowels plays a significant role in the analysis and synthesis of speech signal. Detecting vowels within a speech utterance in noisy environment and varied contexts is a very challenging task. In this work, a robust technique based on non-local means (NLM) estimation is proposed for the detection of vowels in noisy speech signals. In the NLM algorithm, the signal value at each sample point is estimated as the weighted sum of signal values at other sample points within a search neighborhood. The weight value is computed by finding square of the difference between the signal values belonging to two different segments. During the estimation, one segment is kept as fixed, while other segment is slid over the search neighborhood. For any particular sample point, the sum of those weight values is significantly less when the segments under consideration are higher in magnitude. In a given speech signal, the vowels are regions of high energy. This will be true even under noisy conditions. In this work, the sum of weight values (SWV), computed at each time instant is used as a discriminating feature for detecting the vowels in a given speech signal. In the proposed approach, the regions where the SWV exhibits significant transitions and attain lower values for a considerable duration of time compared to the preceding and succeeding regions are hypothesized as the vowels. This hypothesis is statistically validated for detecting vowels under clean as well as noisy test conditions. For proper comparison, a three-class statistical classifier (vowel, non-vowel and silence) is developed for detecting the vowels in a given speech signal. For developing the said classifier, the mel-frequency cepstral coefficients are used as the acoustic feature vectors, while deep neural network (DNN)-hidden Markov model (HMM) is employed for acoustic modeling. The proposed vowel detection method is observed to outperform the DNN-HMM-based statistical classifiers as well as existing signal processing approaches under both clean and noisy test conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. N. Adiga, S.M. Prasanna, A hybrid text-to-speech synthesis using vowel and non vowel like regions, in Proceedings of the INDICON, pp. 1–5 (2014)

  2. T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training. Proc. Int. Conf. Spok. Lang. Process 2, 1137–1140 (1996)

    Google Scholar 

  3. A. Buades, B. Coll, J.M. Morel, A non-local algorithm for image denoising, in Computer Vision and Pattern Recognition, pp. 60–65 (2005)

  4. A. Buades, B. Coll, J.M. Morel, A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4(2), 490–530 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process 20(1), 30–42 (2012)

    Article  Google Scholar 

  6. S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process ASSP–28(4), 357–366 (1980)

    Article  Google Scholar 

  7. K.T. Deepak, S.R.M. Prasanna, Foreground speech segmentation and enhancement using glottal closure instants and mel cepstral coefficients. IEEE/ACM Trans. Audio Speech Lang. Process 24(7), 1205–1219 (2016)

    Article  Google Scholar 

  8. C.A. Deledalle, V. Duval, J. Salmon, Non-local methods with shape-adaptive patches (nlm-sap). J. Math. Imaging Vis. 43(2), 103–120 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  9. V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Audio Speech Lang. Process 3(5), 357–366 (1995)

    Article  Google Scholar 

  10. N. Fakotakis, A. Tsopanoglou, G. Kokkinakis, A text-independent speaker recognition system based on vowel spotting. Speech Commun. 12(1), 57–68 (1993)

    Article  Google Scholar 

  11. M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process 7(3), 272–281 (1999)

    Article  Google Scholar 

  12. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, vol. 33 (Linguistic Data Consortium, Philadelphia, 1993)

    Google Scholar 

  13. D.J. Hermes, Vowel onset detection. J. Acoust. Soc. Am. 87(2), 866–873 (1990)

    Article  Google Scholar 

  14. Q. Jin, A. Waibel, Application of LDA to speaker recognition, in Proceedings of the Interspeech, pp. 250–253 (2000)

  15. A. Kumar, S. Shahnawazuddin, G. Pradhan, Exploring different acoustic modeling techniques for the detection of vowels in speech signal, in Proceedings of the National Conference on Communication (NCC), pp. 1–5 (2016)

  16. A. Kumar, S. Shahnawazuddin, G. Pradhan, Improvements in the detection of vowel onset and offset points in a speech sequence. Circuits Syst. Signal Process 36, 1–26 (2016)

    MathSciNet  Google Scholar 

  17. A. Kumar, S. Shahnawazuddin, G. Pradhan, Non-local estimation of speech signal for vowel onset point detection in varied environments, in Proceedings of the Interspeech, Stockholm, Sweden, pp. 429–433 (2017)

  18. T. Lotter, P. Vary, Speech enhancement by map spectral amplitude estimation using a super-Gaussian speech model. EURASIP J. Appl. Signal Process 2005, 1110–1126 (2005)

    MATH  Google Scholar 

  19. Y. Lu, P.C. Loizou, Estimators of the magnitude-squared spectrum and methods for incorporating snr uncertainty. IEEE Trans. Audio Speech Lang. Process 19(5), 1123–1137 (2011)

    Article  Google Scholar 

  20. V.K. Mittal, B. Yegnanarayana, Effect of glottal dynamics in the production of shouted speech. J. Acoust. Soc. Am. 133(5), 3050–3061 (2013)

    Article  Google Scholar 

  21. V.K. Mittal, B. Yegnanarayana, Study of changes in glottal vibration characteristics during laughter, in INTERSPEECH (2014)

  22. V.K. Mittal, B. Yegnanarayana, P. Bhaskararao, Study of the effects of vocal tract constriction on glottal vibration. J. Acoust. Soc. Am. 136(4), 1932–1941 (2014)

    Article  Google Scholar 

  23. D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace Gaussian mixture model—a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)

    Article  Google Scholar 

  24. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The kaldi speech recognition toolkit, in Workshop on Automatic Speech Recognition and Understanding (2011)

  25. G. Pradhan, B. Haris, S. Prasanna, R. Sinha, Speaker verification in sensor and acoustic environment mismatch conditions. Int. J. Speech Technol. 15(3), 381–392 (2012)

    Article  Google Scholar 

  26. G. Pradhan, A. Kumar, S. Shahnawazuddin, Excitation source features for improving the detection of vowel onset and offset points in a speech sequence, in Proceedings of the Interspeech 2017, pp. 1884–1888 (2017)

  27. G. Pradhan, S.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process 21(4), 854–867 (2013)

    Article  Google Scholar 

  28. S.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Trans. Audio Speech Lang. Process 19(8), 2552–2565 (2011)

    Article  Google Scholar 

  29. S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process 17(4), 556–565 (2009)

    Article  Google Scholar 

  30. S.R.M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation source information, in Proceedings of the Interspeech, pp. 1133–1136 (2005)

  31. J. Rao, C.C. Sekhar, B. Yegnanarayana, Neural network based approach for detection of vowel onset points, in Proceedings of the International Conference on Advanced Pattern Recognition Digital Technologies, vol. 1, pp. 316–320 (1999)

  32. K.S. Rao, A.K. Vuppala, Speaker Identification and Time Scale Modification Using VOPs (Springer, Berlin, 2014)

    Book  Google Scholar 

  33. K.S. Rao, B. Yegnanarayana, Duration modification using glottal closure instants and vowel onset points. Speech Commun. 51(12), 1263–1269 (2009)

    Article  Google Scholar 

  34. S.P. Rath, D. Povey, K. Veselý, J. Cernocký, Improved feature process for deep neural networks, in Proceedings of the Interspeech (2013)

  35. B.S. Reddy, K.V. Rao, S.M. Prasanna, Keyword spotting using vowel onset point, vector quantization and hidden Markov modeling based techniques, in Proceedings of the TENCON, pp. 1–4 (2008)

  36. B. Sarma, S. Prajwal, S.M. Prasanna, Improved vowel onset and offset points detection using bessel features, in Proceedings of the International Conference on Signal Processing and Communication, pp. 1–6 (2014)

  37. P. Singh, G. Pradhan, Exploring the non-local similarity present in variational mode functions for effective ecg denoising, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 861–865. IEEE (2018)

  38. P. Singh, G. Pradhan, Variational mode decomposition based ecg denoising using non-local means and wavelet domain filtering. Australas. Phys. Eng. Sci. Med. 41, 1–14 (2018)

    Article  Google Scholar 

  39. P. Singh, S. Shahnawazuddin, G. Pradhan, An efficient ecg denoising technique based on non-local means estimation and modified empirical mode decomposition. Circuits Syst. Signal Process. 37(10), 4527–4547 (2018)

    Article  Google Scholar 

  40. N. Srinivas, G. Pradhan, S. Shahnawazuddin, Enhancement of noisy speech signal by non-local means estimation of variational mode functions. Proc. Interspeech 2018, 1156–1160 (2018)

    Article  Google Scholar 

  41. K.N. Stevens, Acoustic Phonetics (The MIT Press Cambridge, London, 2000)

    Google Scholar 

  42. B.H. Tracey, E.L. Miller, Nonlocal means denoising of ecg signals. IEEE Trans. Biomed. Eng. 59(9), 2383–2386 (2012)

    Article  Google Scholar 

  43. D. Van De Ville, M. Kocher, Sure-based non-local means. IEEE Signal Process. Lett. 16(11), 973–976 (2009)

    Article  Google Scholar 

  44. A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)

    Article  Google Scholar 

  45. A. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012)

    Article  Google Scholar 

  46. A.K. Vuppala, K.S. Rao, Vowel onset point detection for noisy speech using spectral energy at formant frequencies. Int. J. Speech Technol. 16(2), 229–235 (2013)

    Article  Google Scholar 

  47. A.K. Vuppala, K.S. Rao, S. Chakrabarti, Improved vowel onset point detection using epoch intervals. AEU Int. J. Electron. Commun. 66(8), 697–700 (2012)

    Article  Google Scholar 

  48. H.K. Vydana, S.R. Kadiri, A.K. Vuppala, Vowel-based non-uniform prosody modification for emotion conversion. Circuits Syst. Signal Process. 35(5), 1643–1663 (2016)

    Article  Google Scholar 

  49. J. Wang, C. Hu, S. Hung, J. Lee, A hierarchical neural network based C/V segmentation algorithm for Mandarin speech recognition. IEEE Trans. Signal Process. 39(9), 2141–2146 (1991)

    Article  Google Scholar 

  50. J.H. Wang, S.H. Chen, A C/V segmentation algorithm for Mandarin speech using wavelet transforms, in Proceedings of the International Conference on Acoustics, Speech, Signal Process., vol. 1, pp. 417–420 (1999)

  51. P.J. Wolfe, S.J. Godsill, Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement, in Signal Processing Workshop on Statistical Signal Processing, pp. 496–499 (2001)

  52. J. Yadav, K.S. Rao, Detection of vowel offset point from speech signal. IEEE Signal Process. Lett. 20(4), 299–302 (2013)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions that have greatly improved the quality of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gayadhar Pradhan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A., Pradhan, G. & Shahnawazuddin, S. An Adaptive Method for Robust Detection of Vowels in Noisy Environment. Circuits Syst Signal Process 38, 4180–4201 (2019). https://doi.org/10.1007/s00034-019-01052-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01052-x

Keywords

Navigation