Abstract
Detecting the vowel regions in a given speech signal has been a challenging area of research for a long time. A number of works have been reported over the years to accurately detect the vowel regions and the corresponding vowel onset points (VOPs) and vowel end points (VEPs). Effectiveness of the statistical acoustic modeling techniques and the front-end signal processing approaches has been explored in this regard. The work presented in this paper aims at improving the detection of vowel regions as well as the VOPs and VEPs. A number of statistical modeling approaches developed over the years have been employed in this work for the aforementioned task. To do the same, three-class classifiers (vowel, nonvowel and silence) are developed on the TIMIT database employing the different acoustic modeling techniques and the classification performances are studied. Using any particular three-class classifier, a given speech sample is then forced-aligned against the trained acoustic model under the constraints of first-pass transcription to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of VOPs and VEPs. In addition to that, a novel front-end feature extraction technique exploiting the temporal and spectral characteristics of the excitation source information in the speech signal is also proposed. The use of the proposed excitation source feature results in the detection of vowel regions that are quite different from those obtained through the mel-frequency cepstral coefficients. Exploiting those differences in the obtained evidences by using the two kinds of features, a technique to combine the evidences is also proposed in order to get a better estimate of the VOPs and VEPs. When the proposed techniques are evaluated on the vowel–nonvowel classification systems developed using the TIMIT database, significant improvements are noted. Moreover, the improvements are noted to hold across all the acoustic modeling paradigms explored in the presented work.
Similar content being viewed by others
Notes
It is to note that, for all the discussed configuration parameters, the chosen values are taken from the Kaldi recipe.
References
T.V. Ananthapadmanabha, B. Yegnanarayana, Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust. Speech Signal Process.27(4):309319 (1979)
T. Anastasakos, J. Mcdonough, R. Schwartz, J. Makhoul, A compact model for speaker-adaptive training. Int. Conf. Spoken Lang. Process. 2, 1137–1140 (1996)
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)
B. Dev Sarma, S.R.M. Prasanna, Analysis of spurious vowel-like regions (VLRS) detected by excitation source information, in Annual IEEE India Conference, pp. 1–5 (2013)
V. Digalakis, D. Rtischev, L. Neumeyer, Speaker adaptation using constrained estimation of gaussian mixtures. IEEE Trans. Speech Audio Process. 3(5), 357–366 (1995)
N. Fakotakis, J. Sirigos, A high performance text independent speaker recognition system based on vowel spotting and neural nets. Int. Conf. Acoust. Speech Signal Process. 2, 661–664 (1996)
M. Gales, Cluster adaptive training of hidden Markov models. IEEE Trans. Speech Audio Process. 8(4), 417–428 (2000)
M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999)
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT acoustic-phonetic continuous speech corpus LDC93S1, vol. 33. Linguistic Data Consortium (1993)
D.J. Hermes, Vowel onset detection. J. Acoust. Soc. Am. 87(2), 866–873 (1990)
G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Q. Jin, A. Waibel, Application of LDA to speaker recognition, in INTERSPEECH, pp. 250–253 (2000)
Kaldi Toolkit: http://kaldi.sourceforge.net
P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)
B.K. Khonglah, B.D. Sarma, S.R.M. Prasanna, Exploration of deep belief networks for vowel-like regions detection, in Annual IEEE India Conference, pp. 1–5 (2014)
R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(04), 561–580 (1975)
L. Mary, B. Yegnanarayana, Extraction and representation of prosodic features for language and speaker recognition. Speech Commun. 50(10), 782–796 (2008)
K.S.R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio Speech Lang. Process. 16(8), 1602–1613 (2008)
D. Pati, S. Prasanna, Speaker information from subband energies of linear prediction residual. In: National Conference on Communication, pp. 1–4 (2010)
D. Povey, L. Burget, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiat, A. Rastrow, R. Rose, P. Schwarz, S. Thomas, Subspace gaussian mixture models for speech recognition, in International Conference on Acoustics, Speech and Signal Processing, pp. 4330–4333 (2010)
D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafiát, A. Rastrow, R.C. Rose, P. Schwarz, S. Thomas, The subspace gaussian mixture model-a structured model for speech recognition. Comput. Speech Lang. 25(2), 404–439 (2011)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit, in Workshop on Automatic speech Recognition and Understanding (2011)
G. Pradhan, S.R.M. Prasanna, Speaker verification under degraded condition: a perceptual study. Int. J. Speech Technol. 14(4), 405–417 (2011)
G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
S.R.M. Prasanna, S.V. Gangashetty, B. Yegnanarayana, Significance of vowel onset point for speech analysis, in International Conference on Signal Processing and Communications, pp. 81–88 (2001)
S.R.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded condition. IEEE Trans. Audio Speech Lang. Process. 19(8), 2552–2565 (2011)
S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process. 17(4), 556–565 (2009)
S.R.M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation source information, in INTERSPEECH, pp. 1133–1136 (2005)
J.Y.S.R.K. Rao, C.C. Sekhar, B. Yegnanarayana, Neural network based approach for detection of vowel onset points, in International Conference on Advances in Pattern Recognition and Digital Techniques, vol. 1, pp. 316–320 (1999)
S.P. Rath, D. Povey, K. Vesel, Cernock, J.: Improved feature process. for deep neural networks, in INTERSPEECH, pp. 109–113 (2013)
R.C. Rose, S.C. Yin, Y. Tang, An investigation of subspace modeling for phonetic and speaker variability in automatic speech recognition, in International Conference on Acoustics, Speech and Signal Processing, pp. 4508–4511 (2011)
B. Sarma, S. Prajwal, S.M. Prasanna, Improved vowel onset and offset points detection using bessel features, in International Conference on Signal Processing and Communications, pp. 1–6 (2014)
K.N. Stevens, Acoustic Phonetics (The MIT Press Cambridge, Massachusetts, 2000)
A. Vuppala, J. Yadav, S. Chakrabarti, K.S. Rao, Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20(6), 1894–1903 (2012)
A.K. Vuppala, K.S. Rao, Vowel onset point detection for noisy speech using spectral energy at formant frequencies. Int. J. Speech Technol. 16(2), 229–235 (2013)
K. Vuppala, K.S. Rao, S. Chakrabarti, Improved vowel onset point detection using epoch intervals. AEU-Int. J. Electron. Commun. 66(8), 697–700 (2012)
J. Wang, C. Hu, S. Hung, J. Lee, A hierarchical neural network based C/V segmentation algorithm for Mandarin speech recognition. IEEE Trans. Signal Process. 39(9), 2141–2146 (1991)
J.H. Wang, S.H. Chen, A C/V segmentation algorithm for Mandarin speech using wavelet transforms. Int. Conf. Acoust. Speech Signal Process. 1, 417–420 (1999)
J. Yadav, K.S. Rao, Detection of vowel offset point from speech signal. IEEE Signal Process. Lett. 20(4), 299–302 (2013)
B. Yegnanarayana, C. Avendano, H. Hermansky, P.S. Murthy, Speech enhancement using linear prediction residual. Speech Commun. 28(1), 25–42 (1999)
B. Yegnanarayana, P.S. Murthy, Enhancement of reverberant speech using LP residual signal. IEEE Trans. Speech Audio Process. 8(3), 267–281 (2000)
X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in International Conference on Acoustics, Speech and Signal Processing, pp. 215–219 (2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kumar, A., Shahnawazuddin, S. & Pradhan, G. Improvements in the Detection of Vowel Onset and Offset Points in a Speech Sequence. Circuits Syst Signal Process 36, 2315–2340 (2017). https://doi.org/10.1007/s00034-016-0409-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-016-0409-1