Integration of multiple acoustic and language models for improved Hindi speech recognition system

Aggarwal, R. K.; Dave, M.

doi:10.1007/s10772-012-9131-y

Integration of multiple acoustic and language models for improved Hindi speech recognition system

Published: 03 February 2012

Volume 15, pages 165–180, (2012)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

R. K. Aggarwal¹ &
M. Dave¹

446 Accesses
12 Citations
Explore all metrics

Abstract

Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each having its own feature set and classification technique. For design and development of the complete system, three separate acoustic models are used with three different feature sets and two language models. Experimental result shows that word error rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Speech Emotion Recognition: A Comprehensive Survey

Article 08 March 2023

References

Aggarwal, R. K., & Dave, M. (2011a). Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I). International Journal of Speech Technology, 14, 297–308.
Article Google Scholar
Aggarwal, R. K., & Dave, M. (2011b). Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). International Journal of Speech Technology, 14, 309–320.
Article Google Scholar
Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 16(1), 89–114.
Article Google Scholar
Becchetti, C., & Ricotti, K. P. (2004). Speech recognition theory and C++ implementation. New York: Wiley.
Google Scholar
Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752.
Article Google Scholar
Beyerlein, P. (1998). Discriminative model combination. In Proceedings ICASSP (pp. 481–484).
Google Scholar
Bourlard, H., Morgan, N., & Renals, S. (1992). Neural nets and hidden Markov models: review and generalizations. Speech Communication, 11, 237–246.
Article Google Scholar
Chopde, A. (2009). Itrans Indian language transliteration package version, 5.2 source. http://www.aczone.com/itrans/.
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 357–366.
Article Google Scholar
Digalakis, V. V., & Murveit, H. (1994). Genones: Optimization the degree of tying in a large vocabulary HMM-based speech recognizer. In Proceeding of IEEE ICASSP (pp. 537–540).
Google Scholar
Fiscus, J. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In Proceeding of the IEEE workshop on automatic speech recognition and understanding (ASRU’97), Santa Barbara (pp. 347–352).
Chapter Google Scholar
Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518.
Article Google Scholar
Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of ICASSP (pp. 13–16).
Google Scholar
Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19(3), 3–30.
Article Google Scholar
Hermansky, H. (1990). Perceptually predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.
Article Google Scholar
Kirchhoff, K., Fink, G. A., & Sagerer, G. (2000). Conversational speech recognition using acoustic and articulatory input. In Proceeding IEEE ICASSP, Istanbul, Turkey.
Google Scholar
Kirchhoff, K., & Bilmes, J. A. (2000). Combination and joint Training of acoustic classifiers for speech Recognition. In ISCA ITRW workshop on automatic speech recognition: challenges for the new mllennium, Paris, France.
Google Scholar
Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297.
Article Google Scholar
Leggetter, C. J., & Woodland, P. (1995). Speaker adaptation using maximum likelihood linear regression. Computer Speech and Language, 9(2), 171–185.
Article Google Scholar
Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other application of confusion network. Computer Speech and Language, 14(4), 373–400.
Article Google Scholar
O’Shaughnessy, D. (2008). Automatic speech recognition: history, methods and challenges. Pattern Recognition, 41, 2965–2979. Invited paper.
Article MATH Google Scholar
Padmanabhan, M., & Picheny, M. (2002). Large vocabulary speech recognition algorithms. IEEE Computer Society, 35(4), 42–50.
Article Google Scholar
Rao, G. V. R., & Yegnanarayana, B. (1991). Word boundary hypothesization in Hindi speech. Computer Speech and Language, 5, 379–392.
Article Google Scholar
Rao, K. S. (2011). Application of prosody models for developing speech systems in Indian languages. International Journal of Speech Technology, 14, 19–33.
Article Google Scholar
Samir, A., Duchateau, J., & Vanhamme, H. (2008). Discriminative model combination and language model selection in a reading tutor for children. In Interspeech, ISCA, Brisbane Australia (pp. 2795–2798).
Google Scholar
Sankar, A. (2005). Bayesian model combination (Baycom) for improved recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing.
Google Scholar
Saraswathi, S., & Geetha, T. (2007). Comparison of morpheme-based language model with different word-based models for improving the performance of Tamil speech recognition system. ACM Transaction on Asian Language Information Processing, 6(3), article 9.
Schwenk, H., & Gauvain, J.-L. (2000). Combining multiple speech recognizers using voting and language model information. In IEEE international conference on spoken language processing (ICSLP), II Pekin (pp. 915–918).
Google Scholar
Silsbee, P., & Bovik, A. (1996). Computer lip-reading for improved accuracy in ASR. IEEE Transactions on Speech and Audio Processing, 4(5), 337–351.
Article Google Scholar
Siohan, O., Ramabhadran, B., & Kingsbury, B. (2005). Constructing ensembles of ASR systems using randomized decision trees. In ICASSP (Vol. I, pp. 197–200).
Google Scholar
Stolcke, A., et al. (2000). The SRI March 2000 Hub-5 conversational speech transcription system. In Proc. speech transcription workshop.
Google Scholar
Stolke, A., Konig, Y., & Weintraub, M. (1997). Explicit word error minimization in N-best list rescoring. In Proc. Eurospeech (Vol. 1, pp. 163–166).
Google Scholar
Vaid, J., & Gupta, A. (2002). Exploring word recognition in a semi alphabetic script: the case of devnagari. Brain and Language, 81, 679–690.
Article Google Scholar
Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. ESCA Journal of Speech Communication, 12(3), 247–251.
Article Google Scholar
Waibel, A., Sawai, H., & Shikano, K. (1989). Modularity and scaling in large phonemic neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898.
Article Google Scholar
Woodland, P., Gales, M., Pye, D., & Young, S. (1997). Broadcast news transcription using HTK. In Proceeding of IEEE international conference on acoustics, speech and signal processing, ICASSP, Munich, Germany (Vol. 2, pp. 719–722).
Google Scholar
Young, S., et al. (2009). The HTK Book. Microsoft Corporation and Cambridge University Engineering Department.
Zhang, R., & Rudnicky, A. (2006). Investigations of Issues for using multiple acoustic models to improve CSR. In IEEE international conference on spoken language processing, Pitsburgh, PA, USA.
Google Scholar
Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, N.I.T., Kurukshetra, India
R. K. Aggarwal & M. Dave

Authors

R. K. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
M. Dave
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. K. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, R.K., Dave, M. Integration of multiple acoustic and language models for improved Hindi speech recognition system. Int J Speech Technol 15, 165–180 (2012). https://doi.org/10.1007/s10772-012-9131-y

Download citation

Received: 21 September 2011
Accepted: 21 January 2012
Published: 03 February 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10772-012-9131-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integration of multiple acoustic and language models for improved Hindi speech recognition system

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Speech Emotion Recognition: A Comprehensive Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integration of multiple acoustic and language models for improved Hindi speech recognition system

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Speech Emotion Recognition: A Comprehensive Survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation