Skip to main content
Log in

Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system

  • Published:
Telecommunication Systems Aims and scope Submit manuscript

Abstract

State-of-the-art automatic speech recognition (ASR) systems follow a well established statistical paradigm, that of parameterization of speech signals (a.k.a. feature extraction) at front-end and likelihood evaluation of feature vectors at back-end. For feature extraction, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are the two dominant signal processing methods, which have been used mainly in ASR. Although the effects of both techniques have been analyzed individually, it is not known whether any combination of the two can produce an improvement in the recognition accuracy or not. This paper presents an investigation on the possibility to integrate different types of features such as MFCC, PLP and gravity centroids to improve the performance of ASR in the context of Hindi language. Our experimental results show a significant improvement in case of such few combinations when applied to medium size lexicons in typical field conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Benzeghiba, M., Mori, R. D., Deroo, O., Dupont, S., et al. (2007). Automatic speech recognition and speech variability, a review. ESCA Transactions on Speech Communication, 49(10–11), 763–786.

    Article  Google Scholar 

  2. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  3. Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.

    Article  Google Scholar 

  4. Paliwal, K. K. (1998). Spectral subband centroid features for speech recognition. In Proceedings of IEEE international conference on acoustics, speech and signal processing, ICASSP (Vol. 2, pp. 617–620).

    Google Scholar 

  5. Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proc. IEEE conference on acoustic speech and signal processing (Vol. 2, pp. 289–292).

    Google Scholar 

  6. Sharma, A., Shrotriya, M. C., Farooq, O., & Abbasi, Z. A. (2008). Hybrid wavelet based LPC features for Hindi speech recognition. International Journal of Information and Communication Technology, 1, 373–381. Inderscience publisher.

    Article  Google Scholar 

  7. Psutka, J., Muller, L., & Psutka, J. V. (2001). Comparison of MFCC and PLP parameterization in the speaker independent continuous speech recognition task. In Proceeding of EUROSPEECH, Denmark (pp. 1813–1816).

    Google Scholar 

  8. Rabiner, L. R., & Juang, B. H. (2006). Speech recognition: statistical methods. In Encyclopedia of linguistics (pp. 1–18).

    Google Scholar 

  9. Forney, G. D. (1973). The Viterbi algorithm. Proceedings of the IEEE, 61, 268–278.

    Article  Google Scholar 

  10. Koehler, J., Morgan, N., Hermansky, H., Hirsch, H. G., & Tong, G. (1994). Integrating RASTA-PLP into speech recognition. In IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 421–424).

    Google Scholar 

  11. Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.

    Article  Google Scholar 

  12. Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297.

    Article  Google Scholar 

  13. Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518.

    Article  Google Scholar 

  14. Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19(3), 3–30.

    Article  Google Scholar 

  15. Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP, Munich, Germany (Vol. 2, pp. 719–722).

    Google Scholar 

  16. Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.

    Article  Google Scholar 

  17. Beyerlein, P. (1997). Discriminative model combination. In Proceeding of IEEE automatic speech recognition and understanding workshop, Santa Barbara, CA (pp. 238–245).

    Google Scholar 

  18. Tolba, H., Selouani, S. A., & O’Shaughnessy, D. (2002). Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP (Vol. 1, pp. 837–840).

    Google Scholar 

  19. Fiscus, J. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proceedings of the IEEE ASRU workshop, Santa Barbara (pp. 347–352).

    Google Scholar 

  20. Vergin, R., O’Shaughnessy, D., & Farhat, A. (1999). Generalized Mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Transactions on Speech and Audio Processing, 7(5), 525–532.

    Article  Google Scholar 

  21. Gowdy, J., & Tufekci, Z. (2000). Mel scaled discrete wavelet coefficients for speech recognition. ICASSP Proceedings, 3, 1351–1354.

    Google Scholar 

  22. Burget, Lukas, Matejka, Pavel, et al. (2007). Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 1979–1986.

    Article  Google Scholar 

  23. Baum, L. E., & Eagon, J. A. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematical Society, 73, 360–363.

    Article  Google Scholar 

  24. Welch, L. R. (2003). HMMs and the Baum–Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.

    Google Scholar 

  25. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  26. Jiang, H. (2010). Discriminative training of HMMs model for automatic speech recognition: a survey. Computer Speech and Language, 24, 589–608. Elsevier.

    Article  Google Scholar 

  27. Leggetter, C. J., & Woodland, P. (1995). Speaker adaptation using maximum likelihood linear regression. Computer Speech and Language, 9(2), 171–185.

    Article  Google Scholar 

  28. Digalakis, V. V., & Murveit, H. (1994). Genones: optimizing the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceeding of IEEE ICASSP (pp. 537–540).

    Google Scholar 

  29. Hwang, M., & Huang, X. (1992). Subphonetic modeling with Markov states—Senone. In Proceeding of IEEE ICASSP (Vol. 1, pp. 33–36).

    Google Scholar 

  30. Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.

    Article  Google Scholar 

  31. Hidden Markov model toolkit (HTK-3.4.1): http://htk.eng.cam.ac.uk.

  32. SPHINX, An open source: http://cmusphinx.sourceforge.net/html/cmusphinx.php.

  33. ELRA catalogue, The EMILLE/CIIL Corpus, catalogue reference: ELRA-W0037, http://catalog.elra.info/product_info.php?products_id=696&keywords=mic.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. K. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, R.K., Dave, M. Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system. Telecommun Syst 52, 1457–1466 (2013). https://doi.org/10.1007/s11235-011-9623-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11235-011-9623-0

Keywords

Navigation