Abstract
State-of-the-art automatic speech recognition (ASR) systems follow a well established statistical paradigm, that of parameterization of speech signals (a.k.a. feature extraction) at front-end and likelihood evaluation of feature vectors at back-end. For feature extraction, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are the two dominant signal processing methods, which have been used mainly in ASR. Although the effects of both techniques have been analyzed individually, it is not known whether any combination of the two can produce an improvement in the recognition accuracy or not. This paper presents an investigation on the possibility to integrate different types of features such as MFCC, PLP and gravity centroids to improve the performance of ASR in the context of Hindi language. Our experimental results show a significant improvement in case of such few combinations when applied to medium size lexicons in typical field conditions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Benzeghiba, M., Mori, R. D., Deroo, O., Dupont, S., et al. (2007). Automatic speech recognition and speech variability, a review. ESCA Transactions on Speech Communication, 49(10–11), 763–786.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.
Paliwal, K. K. (1998). Spectral subband centroid features for speech recognition. In Proceedings of IEEE international conference on acoustics, speech and signal processing, ICASSP (Vol. 2, pp. 617–620).
Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proc. IEEE conference on acoustic speech and signal processing (Vol. 2, pp. 289–292).
Sharma, A., Shrotriya, M. C., Farooq, O., & Abbasi, Z. A. (2008). Hybrid wavelet based LPC features for Hindi speech recognition. International Journal of Information and Communication Technology, 1, 373–381. Inderscience publisher.
Psutka, J., Muller, L., & Psutka, J. V. (2001). Comparison of MFCC and PLP parameterization in the speaker independent continuous speech recognition task. In Proceeding of EUROSPEECH, Denmark (pp. 1813–1816).
Rabiner, L. R., & Juang, B. H. (2006). Speech recognition: statistical methods. In Encyclopedia of linguistics (pp. 1–18).
Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61, 268–278.
Koehler, J., Morgan, N., Hermansky, H., Hirsch, H. G., & Tong, G. (1994). Integrating RASTA-PLP into speech recognition. In IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 421–424).
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297.
Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518.
Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19(3), 3–30.
Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP, Munich, Germany (Vol. 2, pp. 719–722).
Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.
Beyerlein, P. (1997). Discriminative model combination. In Proceeding of IEEE automatic speech recognition and understanding workshop, Santa Barbara, CA (pp. 238–245).
Tolba, H., Selouani, S. A., & O’Shaughnessy, D. (2002). Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP (Vol. 1, pp. 837–840).
Fiscus, J. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proceedings of the IEEE ASRU workshop, Santa Barbara (pp. 347–352).
Vergin, R., O’Shaughnessy, D., & Farhat, A. (1999). Generalized Mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Transactions on Speech and Audio Processing, 7(5), 525–532.
Gowdy, J., & Tufekci, Z. (2000). Mel scaled discrete wavelet coefficients for speech recognition. ICASSP Proceedings, 3, 1351–1354.
Burget, Lukas, Matejka, Pavel, et al. (2007). Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 1979–1986.
Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73, 360–363.
Welch, L. R. (2003). HMMs and the Baum–Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Jiang, H. (2010). Discriminative training of HMMs model for automatic speech recognition: a survey. Computer Speech and Language, 24, 589–608. Elsevier.
Leggetter, C. J., & Woodland, P. (1995). Speaker adaptation using maximum likelihood linear regression. Computer Speech and Language, 9(2), 171–185.
Digalakis, V. V., & Murveit, H. (1994). Genones: optimizing the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceeding of IEEE ICASSP (pp. 537–540).
Hwang, M., & Huang, X. (1992). Subphonetic modeling with Markov states—Senone. In Proceeding of IEEE ICASSP (Vol. 1, pp. 33–36).
Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Hidden Markov model toolkit (HTK-3.4.1): http://htk.eng.cam.ac.uk.
SPHINX, An open source: http://cmusphinx.sourceforge.net/html/cmusphinx.php.
ELRA catalogue, The EMILLE/CIIL Corpus, catalogue reference: ELRA-W0037, http://catalog.elra.info/product_info.php?products_id=696&keywords=mic.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aggarwal, R.K., Dave, M. Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommun Syst 52, 1457–1466 (2013). https://doi.org/10.1007/s11235-011-9623-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-011-9623-0