Abstract
In this paper, we try to present the problem of epoch detection from a different perspective that not only deals with estimation of epoch instances (i.e., glottal activity) but also with quantification of the absence of epochs (i.e., no glottal activity) in the unvoiced regions of speech signal. Most of the epoch detection methods perform significantly well in the voiced regions of speech but are not robust enough in the unvoiced regions of speech, i.e., they detect a number of pseudo epochs in the unvoiced regions of speech. We propose a simple method based on Teager Energy Operator (TEO) which not only determines the epochs in voiced region (due to its superior temporal resolution and its ability to capture airflow properties through the glottis) but also is very effective in unvoiced region. Recently proposed methods such as 0-Hz resonator-based method and DYPSA method gave a combined rate (CR) (for detecting epochs in voiced and unvoiced regions of speech) of 74.7% and 60%, respectively and a pseudo epoch rate (PER) (i.e., spurious epochs in the unvoiced regions of speech) of 62.9% and 54.04%, respectively. On the other hand, our proposed method gave a CR and PER of 87% and 0.27%, respectively. This result suggests that the proposed method captures glottal activity more efficiently both in voiced and unvoiced regions of speech signal. The performance of the proposed method is demonstrated using publicly available CMU-Arctic database using the epoch information from the electro-glottograph (EGG) as reference signal to serve as ground truth for estimation of glottal closure instants (GCI). Due to the noise suppression capability of TEO, the proposed method has almost no or little effect (i.e., robust) against signal degradations like white, babble, high frequency and vehicle noises as compared to 0-Hz resonator and DYPSA methods.
Similar content being viewed by others
References
Ananthapadmanabha, T. V., & Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(4), 309–319.
Atal, B. S., & Hanauer, S. L. (1971). Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50(2), 637–655.
Bahoura, M., & Rouat, J. (2001). Wavelet speech enhancement based on the Teager energy operator. IEEE Signal Processing Letters, 8(1), 10–12.
Boudraa, A. O., Cexus, J. C., & Karim, A. M. (2008). Cross Ψ B -energy operator based signal detection. The Journal of the Acoustical Society of America, 123(6), 4283–4289.
Brookes, M. (2006) Voicebox: A speech processing toolbox for MATLAB. [Online]. A vailable: http://www.ee.imperial.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
Cairns, D. A., & Hansen, J. H. L. (1996). A noninvasive technique for detecting hypernasal speech using a nonlinear operator. IEEE Transactions on Signal Processing, 43(1), 35–44.
Cairns, D. A., Hansen, J. H. L., & Kaiser, J. F. (1996). Recent advances in hypernasal speech detection using the nonlinear Teager energy operator. In Proc. int. conf. spoken lang. process., ICSLP (Vol. 2, pp. 780–783).
“CMU-ARCTIC Speech Synthesis Databases.” [Online]. Available: http://festvox.org/cmu_arctic/index.html.
Dorman, M. F., Raphael, L. J., & Liberman, A. M. (1979). Some experiments on the sound of silence in phonetic perception. The Journal of the Acoustical Society of America, 65(6), 1518–1532.
Hamila, R., Lohan, S., & Renfors, M. (2003). Subchip multipath delay estimation for downlink WCDMA system based on Teager operator. IEEE Communications Letters, 7(1), 1–3.
Jabloun, F., Cetin, A. E., & Erzin, E. (1999). Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Processing Letters, 6(10), 259–261.
Kaiser, J. F. (1990). On a simple algorithm to calculate the ‘energy’ of a signal. In Proc. IEEE int. conf. acoustics, speech, and signal processing, Albuquerque, NM (Vol. 1, pp. 381–384).
Kaushik, L., & Shaughnessy, D. (2009). A novel method for epoch extraction from speech signals. In Interspeech 2009, Brighton, UK (pp. 2883–2886).
Kominek, J., & Black, A. (2004). The CMU-Arctic speech databases. In 5th ISCA Speech Synthesis Workshop, Pittsburgh, PA (pp. 223–224).
Maragos, P., Kaiser, J. F., & Quatieri, T. F. (1991). Speech nonlinearities, modulations and energy operators. In Proc. int. conf. acoustics, speech, and signal processing, Toronto, Canada (pp. 421–424).
Markel, J. E., & Gray, A. H. (1982). Linear prediction of speech. New York: Springer.
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Murty, K. S. R., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.
Naylor, P. A., Kounoudes, A., Gudnason, J., & Brookes, M. (2007). Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 34–43.
Ney, H. (1981). A dynamic programming technique for non-linear smoothing. In Proc. IEEE int. conf. acoust, speech, signal processing (pp. 62–65).
NOISEX-92 [Online]. Available: http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html.
Patil, H. A., & Parhi, K. (2010a). Novel variable length Teager energy based features for person recognition from their Hum. In IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 4526–4529).
Patil, H. A., & Parhi, K. (2010b). Development of TEO phase for speaker recognition. In Proc. of int. conf. on signal process. and Comm, SPCOM’10, Bangalore (pp. 1–5).
Quatieri, T. F. (2002). Discrete-time speech signal processing: Principles and practices. Upper Saddle River: Pearson Education.
Shikhah, N., & Deriche, M. (1999). A novel pitch estimation technique using the Teager energy function. In Int. symposium on signal process. and its applications, ISSPA, Brisbane, Australia (pp. 135–138).
Sinder, D. J. (1999). Speech synthesis using an aeroacoustic fricative model. Ph.D. Thesis, Rutgers University, New Brunswick, NJ.
Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing, 3, 325–333.
Strube, H. W. (1974). Determination of the instant of glottal closures from the speech wave. The Journal of the Acoustical Society of America, 56, 1625–1629.
Teager, H. M., & Teager, S. M. (1990). Evidence for nonlinear sound production mechanisms in the vocal tract. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and speech modeling (pp. 241–261). Dordrecht: Kluwer Academic.
Veenemanand, D., & BeMent, S. (1985). Automatic glottal inverse filtering from speech and electroglottographic signals. IEEE Transactions on Signal Processing, SP-33(4), 369–377.
Yegnanarayana, B., & Veldhuis, R. N. J. (1998). Extraction of vocal tract system characteristics from speech signals. IEEE Transactions on Speech and Audio Processing, 6(4), 313–327.
Zhou, G., Hansen, J. H. L., & Kaiser, J. F. (2001). Non linear feature based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 9(3), 201–216.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Patil, H.A., Viswanath, S. Effectiveness of Teager energy operator for epoch detection from speech signals. Int J Speech Technol 14, 321–337 (2011). https://doi.org/10.1007/s10772-011-9110-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-011-9110-8