Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

Aggarwal, Rajesh Kumar; Dave, M.

doi:10.1007/s10772-011-9106-4

Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

Published: 31 August 2011

Volume 14, pages 309–320, (2011)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Rajesh Kumar Aggarwal¹ &
M. Dave¹

332 Accesses
11 Citations
Explore all metrics

Abstract

In automatic speech recognition (ASR) systems, hidden Markov models (HMMs) have been widely used for modeling the temporal speech signal. As discussed in Part I, the conventional acoustic models used for ASR have many drawbacks like weak duration modeling and poor discrimination. This paper (Part II) presents a review on the techniques which have been proposed in literature for the refinements of standard HMM methods to cope with their limitations. Current advancements related to this topic are also outlined. The approaches emphasized in this part of review are connectionist approach, explicit duration modeling, discriminative training and margin based estimation methods. Further, various challenges and performance issues such as environmental variability, tied mixture modeling, and handling of distant speech signals are analyzed along with the directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effects of Frequency-Based Inter-frame Dependencies on Automatic Speech Recognition

Automatic Speech Recognition Based on Neural Networks

Hidden Markov Model for Speech Recognition System—A Pilot Study and a Naive Approach for Speech-To-Text Model

References

Axelrod, S., & Maison, B. (2004). Combination of hidden Markov models with dynamic time warping for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 173–176).
Google Scholar
Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1986). Maximum mutual information estimation of hidden Markov model parameter for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 49–52), Tokyo.
Google Scholar
Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752.
Article Google Scholar
Beyerlein, P. (1997). Discriminative model combination. In Proceedings of IEEE automatic speech recognition and understanding workshop (pp. 238–245), Santa Barbara, CA.
Chapter Google Scholar
Bilmes, J. A. (1999). Buried Markov models for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 2, pp. 713–716).
Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer, Berlin.
MATH Google Scholar
Bocchieri, E., & Mak, B. K. (2001). Subspace distribution clustering hidden Markov model. IEEE Transactions on Speech and Audio Processing, 9(3), 264–275.
Article Google Scholar
Bonafonte, A., Ros, X., & Marino, J. B. (1993). An efficient algorithm to find the best sequence in HSMM. In Proceedings of Eurospeech (pp. 1547–1550).
Google Scholar
Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach. Kluwer Academic, Boston.
Google Scholar
Bourlard, H., & Wellakens, C. (1990). Links between Markov models and multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(12), 1167–1178.
Article Google Scholar
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Article Google Scholar
Cai, J., Bouselmi, G., Laprie, Y., & Haton, J. P. (2009). Efficient likelihood evaluation and dynamic Gaussian selection for HMM-based speech recognition. Computer Speech and Language, 23, 147–164.
Article Google Scholar
Chan, A., Sherwani, J., Ravishankar, M., & Rudnicky, A. (2004). Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems. In Proceedings of Interspeech (pp. 689–692).
Google Scholar
Chen, J., Benesty, J., Huang, Y., & Doclo, S. (2006). New insights into the noise reduction Wiener filter. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1218–1234.
Article Google Scholar
De Wachter, M., Matton, M., et al. (2007). Template based continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1377–1390.
Article Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithms. Journal of the Royal Statistical Society Series B, 39, 1–38.
MathSciNet MATH Google Scholar
Deng, L. (1992). A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal. Signal Processing, 27, 65–78.
Article MATH Google Scholar
Deng, L. (2006). Dynamic speech models: Theory, applications, and algorithms. San Rafael: Morgan and Claypool.
Google Scholar
Deng, L., Acero, A., Plumpe, M., & Huang, X. (2000). Large vocabulary speech recognition under adverse acoustic environments. In Proceedings of Interspeech (Vol. 13, pp. 806–809).
Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Wiley, New York.
MATH Google Scholar
Ejbali, R., Zaied, M., & Ben Amar, C. (2010). Wavelet network for recognition system of Arabic word. International Journal of Speech Technology, 13, 163–174.
Article Google Scholar
Ellis, D. P. W., & Bilmes, J. A. (2000). Using mutual information to design feature combinations. In Proceedings international conference on spoken language processing (Vol. 3, pp. 79–82).
Google Scholar
Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32, 41–62.
Article MATH Google Scholar
Fiscus, J. (1997). A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In Proceedings of IEEE automatic speech recognition and understanding workshop (pp. 347–352), Santa Barbara.
Chapter Google Scholar
Gales, M., & Woodland, P. C. (1996). Mean and variance adaptation within the MLLR framework. Computer Speech and Language, 10, 249–264.
Article Google Scholar
Gales, M., & Young, S. (1996). Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.
Article Google Scholar
Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Article MATH Google Scholar
Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518.
Article Google Scholar
Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 13–16).
Google Scholar
Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19, 3–30.
Article Google Scholar
Hagen, A., & Neto, J. (2003). Multi-stream processing using context-independent and context-dependent hybrid systems. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 277–280).
Google Scholar
Halberstadt, D., & Glass, J. (1998). Heterogeneous measurements and multiple classifiers for speech recognition. In Proceedings IEEE international conference on spoken language processing (pp. 995–998), Sydney, Australia, ISCA.
Google Scholar
He, X., & Deng, L. (2007). A new look at discriminative training for HMM. Pattern Recognition Letters, 28, 1285–1294.
Article Google Scholar
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Article Google Scholar
Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 289–292).
Google Scholar
Hickok, G., & Poippil, D. (2007). The critical organization of speech processing. Nature Reviews. Neuroscience, 8(5), 393–402.
Article Google Scholar
Hirsch, H. G., & Finster, H. (2008). A new approach for the adaptation of HMMs to reverberation and background noise. Speech Communication, 50, 244–263.
Article Google Scholar
Hughes, T., Kim, B., DiBiase, H. J. H., & Silverman, H. F. (1999). Performance of an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Transactions on Speech and Audio Processing, 7(3), 346–349.
Article Google Scholar
Janin, A., Ellis, D., & Morgan, N. (1999). Multi-stream speech recognition: Ready for prime time. In Proc. Eurospeech (pp. 591–594), Budapest, Hungary, ISCA.
Google Scholar
Jiang, H. (2010). Discriminative training of HMM for automatic speech recognition: A survey. Computer Speech and Language, 24, 589–608.
Article Google Scholar
Jiang, H., & Li, X. (2007). Incorporating training errors for large margin HMMs under semi definite programming framework. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 4, pp. 629–632).
Google Scholar
Jiang, H., Li, X., & Liu, C. (2006). Large margin hidden Markov models for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1584–1589.
Article Google Scholar
Katagiri, S., Juang, B. H., & Lee, C. H. (1998). Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method. Proceedings of the IEEE, 86(11), 2345–2373.
Article Google Scholar
Kenny, P., Lennig, M., & Mermelstein, P. (1990). A linear predictive HMM for vector-valued observations with applications to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(2), 220–225.
Article Google Scholar
Kim, D. Y., Umesh, S., Gales, M. J. F., Hain, T., & Woodland, P. (2004). Using VTLN for broadcast news transcription. In Proceedings of international conference spoken language processing, Jeju, Korea.
Google Scholar
Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239.
Article Google Scholar
Kochler, J., Morgan, N., Hermansky, H., Hirsch, H. G., & Tong, G. (1994). Integrating RASTA-PLP into speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 421–424).
Google Scholar
Kohonen, T. (1990). The self organizing map. Proceedings of the IEEE, 78(9), 1464–1480.
Article Google Scholar
Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297.
Article Google Scholar
Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23–43.
Article Google Scholar
Levinson, S. E. (1986). Continuously variable duration hidden Markov models for auto speech Recognition. Computer Speech and Language, 1(1), 29–45.
Article Google Scholar
Li, X., & Jiang, H. (2005). A constrained joint optimization method for large margin HMM estimation. In Proc. IEEE automatic speech recognition and understanding workshop (pp. 151–156).
Google Scholar
Li, J., Yuan, M., & Lee, C. H. (2006). Soft margin estimation of hidden Markov model parameters. In Proc. Interspeech (pp. 2422–2425).
Google Scholar
Li, J., Yan, Z., Lee, C. H., & Wang, R. H. (2007). A study on soft margin estimation for LVCSR. In Proc. IEEE automatic speech recognition and understanding workshop (pp. 268–271).
Google Scholar
Livescu, K., Glass, J., & Bilmes, J. (2003). Hidden feature models for speech recognition using dynamic Bayesian networks. In Proceedings of Eurospeech (pp. 2529–2532).
Google Scholar
Mari, J. F., & Haton, J. P. (1997). Automatic word recognition based on second order Hidden Markov Models. IEEE Transactions on Speech and Audio Processing, 5(1), 22–25.
Article Google Scholar
Moreno, P. J., Raj, B., & Stern, R. M. (1996). A vector Taylor series approach for environment independent speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 733–736).
Google Scholar
Morgan, N., & Bourlard, H. (1995). Continuous speech recognition: An introduction to hybrid/connectionist approach. IEEE Signal Processing Magazine, 25–42.
Nefian, A., Liang, L., Pi, X., Liu, X., & Murphy, K. (2002). Dynamic Bayesian network for audio visual speech recognition. EURASIP Journal on Applied Signal Processing, 1, 1274–1288.
Google Scholar
Neumeyer, L., Sankar, A., & Digalakis, V. (1995). A comparative study of speaker adaptation techniques. In In Eurospeech (pp. 1127–1130), Madrid, Spain.
Google Scholar
Niles, L., & Silverman, H. (1990). Combining hidden Markov models and neural network classifiers. In ICASSP (pp. 417–420).
Google Scholar
Nock, H., & Young, S. (2000). Loosely coupled HMMs for ASR. In Proc. international conference on speech and language processing (ICSLP), Beijing, China.
Google Scholar
Ostendorf, M., Kannan, A., Kimball, O., & Rohlicek, J. R. (1992). Continuous word recognition based on the stochastic segment model. In Proc. DARPA workshop on CSR (pp. 53–58).
Google Scholar
Paliwal, K. K. (1987). A speech enhancement method based on Kalman filtering. In Proceeding of IEEE ICASSP (pp. 177–180).
Google Scholar
Poritz, A. B. (1988). Hidden Markov models: A guided tour. In Proc. IEEE international conference on acoustic, speech and signal processing (Vol. 11, pp. 7–13), New York.
Google Scholar
Puurula, A., & Van Compernolla, D. (2010). Dual stream speech recognition using articulatory syllable models. International Journal of Speech Technology, 13, 219–230.
Article Google Scholar
Pye, D., & Woodland, P. C. (1997). Experiments in speaker normalization and adaptation for large vocabulary speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1047–1050), Munich, Germany.
Google Scholar
Pylkkonen, J., & Kurimo, M. (2003). Duration modeling techniques for continuous speech recognition. In Proceedings of European conference on speech technology (EUROSPEECH).
Google Scholar
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Article Google Scholar
Ravishankar, M., Bisiani, R., & Thayer, E. (1997). Sub-vector clustering to improve memory and speed performance of acoustic likelihood computation. In Proceedings of Eurospeech (pp. 151–154).
Google Scholar
Robinson, A. J., & Fallside, F. (1991). A recurrent error propagation speech recognition system. Computer Speech and Language, 5, 259–274.
Article Google Scholar
Robinson, T. (1994). The application of recurrent neural nets to phone probability estimation. IEEE Transactions on Neural Networks, 829–832.
Russell, M. J., & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for ASR. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 2376–2379).
Google Scholar
Russell, M. J., & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models for ASR. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 5–8).
Google Scholar
Schluter, R., Macherey, W., Muller, B., & Ney, H. (2001). Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Communication, 34, 287–310.
Article Google Scholar
Schwenk, H., & Gauvain, J. L. (2000). Combining multiple speech recognizers using voting and language model information. In Proceeding international conference on spoken language processing (Vol. II, pp. 915–918). ISCA.
Google Scholar
Sha, F., & Saul, L. K. (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. I265–I268).
Google Scholar
Shannon, B. J., & Paliwal, K. K. (2006). Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. Speech Communication, 48(11), 1458–1485.
Article Google Scholar
Sim, K. C., & Gales, M. J. F. (2006). Minimum phone error training of precision matrix models. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 882–889.
Article Google Scholar
Trentin, E., & Gori, M. (2001). A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing, 37, 91–126.
Article MATH Google Scholar
Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.
Article Google Scholar
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898.
Article Google Scholar
Weber, K., Ikbal, S., Bangio, S., & Bourlard, H. (2003). Robust speech recognition and feature extraction using HMM2. Computer Speech and Language, 17, 195–211.
Article Google Scholar
Welch, L. R. (2003). HMMs and the Baum-Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.
MathSciNet Google Scholar
Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Language, 16(1), 25–47.
Article Google Scholar
Wu, D., Yin, Y., & Jiang, H. (2011). Large margin estimation of hidden Markov models with second order cone programming for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(6), 1652–1664.
Article Google Scholar
Xiao, X., Li, J., Chng, E. S., Li, H., & Lee, C. H. (2010). A study on the generalization capability of acoustic models for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1158–1169.
Article Google Scholar
Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence, 174(2), 215–243.
Article MathSciNet MATH Google Scholar
Zhu, Q., Chen, B., Morgan, N., & Stolcke, A. (2005). Tandem connectionist feature extraction for conversational speech recognition. In LNCS (Vol. 3361, pp. 223–231). Springer, Berlin.
Google Scholar
Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Technology, Kurukshetra, Haryana, India
Rajesh Kumar Aggarwal & M. Dave

Authors

Rajesh Kumar Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
M. Dave
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh Kumar Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, R.K., Dave, M. Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). Int J Speech Technol 14, 309–320 (2011). https://doi.org/10.1007/s10772-011-9106-4

Download citation

Received: 03 August 2011
Accepted: 03 August 2011
Published: 31 August 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10772-011-9106-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

Abstract

Access this article

Similar content being viewed by others

Effects of Frequency-Based Inter-frame Dependencies on Automatic Speech Recognition

Automatic Speech Recognition Based on Neural Networks

Hidden Markov Model for Speech Recognition System—A Pilot Study and a Naive Approach for Speech-To-Text Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

Abstract

Access this article

Similar content being viewed by others

Effects of Frequency-Based Inter-frame Dependencies on Automatic Speech Recognition

Automatic Speech Recognition Based on Neural Networks

Hidden Markov Model for Speech Recognition System—A Pilot Study and a Naive Approach for Speech-To-Text Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation