Skip to main content
Log in

Improvements in the Detection of Vowel Onset and Offset Points in a Speech Sequence

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Detecting the vowel regions in a given speech signal has been a challenging area of research for a long time. A number of works have been reported over the years to accurately detect the vowel regions and the corresponding vowel onset points (VOPs) and vowel end points (VEPs). Effectiveness of the statistical acoustic modeling techniques and the front-end signal processing approaches has been explored in this regard. The work presented in this paper aims at improving the detection of vowel regions as well as the VOPs and VEPs. A number of statistical modeling approaches developed over the years have been employed in this work for the aforementioned task. To do the same, three-class classifiers (vowel, nonvowel and silence) are developed on the TIMIT database employing the different acoustic modeling techniques and the classification performances are studied. Using any particular three-class classifier, a given speech sample is then forced-aligned against the trained acoustic model under the constraints of first-pass transcription to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of VOPs and VEPs. In addition to that, a novel front-end feature extraction technique exploiting the temporal and spectral characteristics of the excitation source information in the speech signal is also proposed. The use of the proposed excitation source feature results in the detection of vowel regions that are quite different from those obtained through the mel-frequency cepstral coefficients. Exploiting those differences in the obtained evidences by using the two kinds of features, a technique to combine the evidences is also proposed in order to get a better estimate of the VOPs and VEPs. When the proposed techniques are evaluated on the vowel–nonvowel classification systems developed using the TIMIT database, significant improvements are noted. Moreover, the improvements are noted to hold across all the acoustic modeling paradigms explored in the presented work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. It is to note that, for all the discussed configuration parameters, the chosen values are taken from the Kaldi recipe.

References

  1. T.V. Ananthapadmanabha, B. Yegnanarayana, Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust. Speech Signal Process.27(4):309319 (1979)

  2. T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training. Int. Conf. Spoken Lang. Process. 2, 1137–1140 (1996)

    Article  Google Scholar 

  3. G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  4. B. Dev Sarma, S.R.M. Prasanna, Analysis of spurious vowel-like regions (VLRS) detected by excitation source information, in Annual IEEE India Conference, pp. 1–5 (2013)

  5. V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)

    Article  Google Scholar 

  6. N. Fakotakis, J. Sirigos, A high performance text independent speaker recognition system based on vowel spotting and neural nets. Int. Conf. Acoust. Speech Signal Process. 2, 661–664 (1996)

    Google Scholar 

  7. M. Gales, Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (2000)

    Article  Google Scholar 

  8. M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)

    Article  Google Scholar 

  9. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT acoustic-phonetic continuous speech corpus LDC93S1, vol. 33. Linguistic Data Consortium (1993)

  10. D.J. Hermes, Vowel onset detection. J. Acoust. Soc. Am. 87(2), 866–873 (1990)

    Article  Google Scholar 

  11. G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)

    Article  Google Scholar 

  12. Q. Jin, A. Waibel, Application of LDA to speaker recognition, in INTERSPEECH, pp. 250–253 (2000)

  13. Kaldi Toolkit: http://kaldi.sourceforge.net

  14. P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)

    Article  Google Scholar 

  15. B.K. Khonglah, B.D. Sarma, S.R.M. Prasanna, Exploration of deep belief networks for vowel-like regions detection, in Annual IEEE India Conference, pp. 1–5 (2014)

  16. R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)

    Article  Google Scholar 

  17. J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(04), 561–580 (1975)

    Article  Google Scholar 

  18. L. Mary, B. Yegnanarayana, Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)

    Article  Google Scholar 

  19. K.S.R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)

    Article  Google Scholar 

  20. D. Pati, S. Prasanna, Speaker information from subband energies of linear prediction residual. In: National Conference on Communication, pp. 1–4 (2010)

  21. D. Povey, L. Burget, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiat, A. Rastrow, R. Rose, P. Schwarz, S. Thomas, Subspace gaussian mixture models for speech recognition, in International Conference on Acoustics, Speech and Signal Processing, pp. 4330–4333 (2010)

  22. D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace gaussian mixture model-a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)

    Article  Google Scholar 

  23. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Workshop on Automatic speech Recognition and Understanding (2011)

  24. G. Pradhan, S.R.M. Prasanna, Speaker verification under degraded condition: a perceptual study. Int. J. Speech Technol. 14(4), 405–417 (2011)

    Article  Google Scholar 

  25. G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)

    Article  Google Scholar 

  26. S.R.M. Prasanna, S.V. Gangashetty, B. Yegnanarayana, Significance of vowel onset point for speech analysis, in International Conference on Signal Processing and Communications, pp. 81–88 (2001)

  27. S.R.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded condition. IEEE Trans. Audio Speech Lang. Process. 19(8), 2552–2565 (2011)

    Article  Google Scholar 

  28. S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process. 17(4), 556–565 (2009)

    Article  Google Scholar 

  29. S.R.M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation source information, in INTERSPEECH, pp. 1133–1136 (2005)

  30. J.Y.S.R.K. Rao, C.C. Sekhar, B. Yegnanarayana, Neural network based approach for detection of vowel onset points, in International Conference on Advances in Pattern Recognition and Digital Techniques, vol. 1, pp. 316–320 (1999)

  31. S.P. Rath, D. Povey, K. Vesel, Cernock, J.: Improved feature process. for deep neural networks, in INTERSPEECH, pp. 109–113 (2013)

  32. R.C. Rose, S.C. Yin, Y. Tang, An investigation of subspace modeling for phonetic and speaker variability in automatic speech recognition, in International Conference on Acoustics, Speech and Signal Processing, pp. 4508–4511 (2011)

  33. B. Sarma, S. Prajwal, S.M. Prasanna, Improved vowel onset and offset points detection using bessel features, in International Conference on Signal Processing and Communications, pp. 1–6 (2014)

  34. K.N. Stevens, Acoustic Phonetics (The MIT Press Cambridge, Massachusetts, 2000)

    Google Scholar 

  35. A. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012)

    Article  Google Scholar 

  36. A.K. Vuppala, K.S. Rao, Vowel onset point detection for noisy speech using spectral energy at formant frequencies. Int. J. Speech Technol. 16(2), 229–235 (2013)

    Article  Google Scholar 

  37. K. Vuppala, K.S. Rao, S. Chakrabarti, Improved vowel onset point detection using epoch intervals. AEU-Int. J. Electron. Commun. 66(8), 697–700 (2012)

    Article  Google Scholar 

  38. J. Wang, C. Hu, S. Hung, J. Lee, A hierarchical neural network based C/V segmentation algorithm for Mandarin speech recognition. IEEE Trans. Signal Process. 39(9), 2141–2146 (1991)

    Article  Google Scholar 

  39. J.H. Wang, S.H. Chen, A C/V segmentation algorithm for Mandarin speech using wavelet transforms. Int. Conf. Acoust. Speech Signal Process. 1, 417–420 (1999)

    Google Scholar 

  40. J. Yadav, K.S. Rao, Detection of vowel offset point from speech signal. IEEE Signal Process. Lett. 20(4), 299–302 (2013)

    Article  Google Scholar 

  41. B. Yegnanarayana, C. Avendano, H. Hermansky, P.S. Murthy, Speech enhancement using linear prediction residual. Speech Commun. 28(1), 25–42 (1999)

    Article  Google Scholar 

  42. B. Yegnanarayana, P.S. Murthy, Enhancement of reverberant speech using LP residual signal. IEEE Trans. Speech Audio Process. 8(3), 267–281 (2000)

    Article  Google Scholar 

  43. X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in International Conference on Acoustics, Speech and Signal Processing, pp. 215–219 (2014)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gayadhar Pradhan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A., Shahnawazuddin, S. & Pradhan, G. Improvements in the Detection of Vowel Onset and Offset Points in a Speech Sequence. Circuits Syst Signal Process 36, 2315–2340 (2017). https://doi.org/10.1007/s00034-016-0409-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-016-0409-1

Keywords

Navigation