Skip to main content
Log in

Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II)

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In automatic speech recognition (ASR) systems, hidden Markov models (HMMs) have been widely used for modeling the temporal speech signal. As discussed in Part I, the conventional acoustic models used for ASR have many drawbacks like weak duration modeling and poor discrimination. This paper (Part II) presents a review on the techniques which have been proposed in literature for the refinements of standard HMM methods to cope with their limitations. Current advancements related to this topic are also outlined. The approaches emphasized in this part of review are connectionist approach, explicit duration modeling, discriminative training and margin based estimation methods. Further, various challenges and performance issues such as environmental variability, tied mixture modeling, and handling of distant speech signals are analyzed along with the directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Axelrod, S., & Maison, B. (2004). Combination of hidden Markov models with dynamic time warping for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 173–176).

    Google Scholar 

  • Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1986). Maximum mutual information estimation of hidden Markov model parameter for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 49–52), Tokyo.

    Google Scholar 

  • Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752.

    Article  Google Scholar 

  • Beyerlein, P. (1997). Discriminative model combination. In Proceedings of IEEE automatic speech recognition and understanding workshop (pp. 238–245), Santa Barbara, CA.

    Chapter  Google Scholar 

  • Bilmes, J. A. (1999). Buried Markov models for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 2, pp. 713–716).

    Google Scholar 

  • Bishop, C. M. (2006). Pattern recognition and machine learning. Springer, Berlin.

    MATH  Google Scholar 

  • Bocchieri, E., & Mak, B. K. (2001). Subspace distribution clustering hidden Markov model. IEEE Transactions on Speech and Audio Processing, 9(3), 264–275.

    Article  Google Scholar 

  • Bonafonte, A., Ros, X., & Marino, J. B. (1993). An efficient algorithm to find the best sequence in HSMM. In Proceedings of Eurospeech (pp. 1547–1550).

    Google Scholar 

  • Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach. Kluwer Academic, Boston.

    Google Scholar 

  • Bourlard, H., & Wellakens, C. (1990). Links between Markov models and multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(12), 1167–1178.

    Article  Google Scholar 

  • Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Cai, J., Bouselmi, G., Laprie, Y., & Haton, J. P. (2009). Efficient likelihood evaluation and dynamic Gaussian selection for HMM-based speech recognition. Computer Speech and Language, 23, 147–164.

    Article  Google Scholar 

  • Chan, A., Sherwani, J., Ravishankar, M., & Rudnicky, A. (2004). Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems. In Proceedings of Interspeech (pp. 689–692).

    Google Scholar 

  • Chen, J., Benesty, J., Huang, Y., & Doclo, S. (2006). New insights into the noise reduction Wiener filter. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1218–1234.

    Article  Google Scholar 

  • De Wachter, M., Matton, M., et al. (2007). Template based continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1377–1390.

    Article  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithms. Journal of the Royal Statistical Society Series B, 39, 1–38.

    MathSciNet  MATH  Google Scholar 

  • Deng, L. (1992). A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal. Signal Processing, 27, 65–78.

    Article  MATH  Google Scholar 

  • Deng, L. (2006). Dynamic speech models: Theory, applications, and algorithms. San Rafael: Morgan and Claypool.

    Google Scholar 

  • Deng, L., Acero, A., Plumpe, M., & Huang, X. (2000). Large vocabulary speech recognition under adverse acoustic environments. In Proceedings of Interspeech (Vol. 13, pp. 806–809).

    Google Scholar 

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Wiley, New York.

    MATH  Google Scholar 

  • Ejbali, R., Zaied, M., & Ben Amar, C. (2010). Wavelet network for recognition system of Arabic word. International Journal of Speech Technology, 13, 163–174.

    Article  Google Scholar 

  • Ellis, D. P. W., & Bilmes, J. A. (2000). Using mutual information to design feature combinations. In Proceedings international conference on spoken language processing (Vol. 3, pp. 79–82).

    Google Scholar 

  • Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32, 41–62.

    Article  MATH  Google Scholar 

  • Fiscus, J. (1997). A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In Proceedings of IEEE automatic speech recognition and understanding workshop (pp. 347–352), Santa Barbara.

    Chapter  Google Scholar 

  • Gales, M., & Woodland, P. C. (1996). Mean and variance adaptation within the MLLR framework. Computer Speech and Language, 10, 249–264.

    Article  Google Scholar 

  • Gales, M., & Young, S. (1996). Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.

    Article  Google Scholar 

  • Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.

    Article  MATH  Google Scholar 

  • Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518.

    Article  Google Scholar 

  • Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 13–16).

    Google Scholar 

  • Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19, 3–30.

    Article  Google Scholar 

  • Hagen, A., & Neto, J. (2003). Multi-stream processing using context-independent and context-dependent hybrid systems. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 277–280).

    Google Scholar 

  • Halberstadt, D., & Glass, J. (1998). Heterogeneous measurements and multiple classifiers for speech recognition. In Proceedings IEEE international conference on spoken language processing (pp. 995–998), Sydney, Australia, ISCA.

    Google Scholar 

  • He, X., & Deng, L. (2007). A new look at discriminative training for HMM. Pattern Recognition Letters, 28, 1285–1294.

    Article  Google Scholar 

  • Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.

    Article  Google Scholar 

  • Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 289–292).

    Google Scholar 

  • Hickok, G., & Poippil, D. (2007). The critical organization of speech processing. Nature Reviews. Neuroscience, 8(5), 393–402.

    Article  Google Scholar 

  • Hirsch, H. G., & Finster, H. (2008). A new approach for the adaptation of HMMs to reverberation and background noise. Speech Communication, 50, 244–263.

    Article  Google Scholar 

  • Hughes, T., Kim, B., DiBiase, H. J. H., & Silverman, H. F. (1999). Performance of an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Transactions on Speech and Audio Processing, 7(3), 346–349.

    Article  Google Scholar 

  • Janin, A., Ellis, D., & Morgan, N. (1999). Multi-stream speech recognition: Ready for prime time. In Proc. Eurospeech (pp. 591–594), Budapest, Hungary, ISCA.

    Google Scholar 

  • Jiang, H. (2010). Discriminative training of HMM for automatic speech recognition: A survey. Computer Speech and Language, 24, 589–608.

    Article  Google Scholar 

  • Jiang, H., & Li, X. (2007). Incorporating training errors for large margin HMMs under semi definite programming framework. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 4, pp. 629–632).

    Google Scholar 

  • Jiang, H., Li, X., & Liu, C. (2006). Large margin hidden Markov models for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1584–1589.

    Article  Google Scholar 

  • Katagiri, S., Juang, B. H., & Lee, C. H. (1998). Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method. Proceedings of the IEEE, 86(11), 2345–2373.

    Article  Google Scholar 

  • Kenny, P., Lennig, M., & Mermelstein, P. (1990). A linear predictive HMM for vector-valued observations with applications to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(2), 220–225.

    Article  Google Scholar 

  • Kim, D. Y., Umesh, S., Gales, M. J. F., Hain, T., & Woodland, P. (2004). Using VTLN for broadcast news transcription. In Proceedings of international conference spoken language processing, Jeju, Korea.

    Google Scholar 

  • Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239.

    Article  Google Scholar 

  • Kochler, J., Morgan, N., Hermansky, H., Hirsch, H. G., & Tong, G. (1994). Integrating RASTA-PLP into speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 421–424).

    Google Scholar 

  • Kohonen, T. (1990). The self organizing map. Proceedings of the IEEE, 78(9), 1464–1480.

    Article  Google Scholar 

  • Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297.

    Article  Google Scholar 

  • Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23–43.

    Article  Google Scholar 

  • Levinson, S. E. (1986). Continuously variable duration hidden Markov models for auto speech Recognition. Computer Speech and Language, 1(1), 29–45.

    Article  Google Scholar 

  • Li, X., & Jiang, H. (2005). A constrained joint optimization method for large margin HMM estimation. In Proc. IEEE automatic speech recognition and understanding workshop (pp. 151–156).

    Google Scholar 

  • Li, J., Yuan, M., & Lee, C. H. (2006). Soft margin estimation of hidden Markov model parameters. In Proc. Interspeech (pp. 2422–2425).

    Google Scholar 

  • Li, J., Yan, Z., Lee, C. H., & Wang, R. H. (2007). A study on soft margin estimation for LVCSR. In Proc. IEEE automatic speech recognition and understanding workshop (pp. 268–271).

    Google Scholar 

  • Livescu, K., Glass, J., & Bilmes, J. (2003). Hidden feature models for speech recognition using dynamic Bayesian networks. In Proceedings of Eurospeech (pp. 2529–2532).

    Google Scholar 

  • Mari, J. F., & Haton, J. P. (1997). Automatic word recognition based on second order Hidden Markov Models. IEEE Transactions on Speech and Audio Processing, 5(1), 22–25.

    Article  Google Scholar 

  • Moreno, P. J., Raj, B., & Stern, R. M. (1996). A vector Taylor series approach for environment independent speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 733–736).

    Google Scholar 

  • Morgan, N., & Bourlard, H. (1995). Continuous speech recognition: An introduction to hybrid/connectionist approach. IEEE Signal Processing Magazine, 25–42.

  • Nefian, A., Liang, L., Pi, X., Liu, X., & Murphy, K. (2002). Dynamic Bayesian network for audio visual speech recognition. EURASIP Journal on Applied Signal Processing, 1, 1274–1288.

    Google Scholar 

  • Neumeyer, L., Sankar, A., & Digalakis, V. (1995). A comparative study of speaker adaptation techniques. In In Eurospeech (pp. 1127–1130), Madrid, Spain.

    Google Scholar 

  • Niles, L., & Silverman, H. (1990). Combining hidden Markov models and neural network classifiers. In ICASSP (pp. 417–420).

    Google Scholar 

  • Nock, H., & Young, S. (2000). Loosely coupled HMMs for ASR. In Proc. international conference on speech and language processing (ICSLP), Beijing, China.

    Google Scholar 

  • Ostendorf, M., Kannan, A., Kimball, O., & Rohlicek, J. R. (1992). Continuous word recognition based on the stochastic segment model. In Proc. DARPA workshop on CSR (pp. 53–58).

    Google Scholar 

  • Paliwal, K. K. (1987). A speech enhancement method based on Kalman filtering. In Proceeding of IEEE ICASSP (pp. 177–180).

    Google Scholar 

  • Poritz, A. B. (1988). Hidden Markov models: A guided tour. In Proc. IEEE international conference on acoustic, speech and signal processing (Vol. 11, pp. 7–13), New York.

    Google Scholar 

  • Puurula, A., & Van Compernolla, D. (2010). Dual stream speech recognition using articulatory syllable models. International Journal of Speech Technology, 13, 219–230.

    Article  Google Scholar 

  • Pye, D., & Woodland, P. C. (1997). Experiments in speaker normalization and adaptation for large vocabulary speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1047–1050), Munich, Germany.

    Google Scholar 

  • Pylkkonen, J., & Kurimo, M. (2003). Duration modeling techniques for continuous speech recognition. In Proceedings of European conference on speech technology (EUROSPEECH).

    Google Scholar 

  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Ravishankar, M., Bisiani, R., & Thayer, E. (1997). Sub-vector clustering to improve memory and speed performance of acoustic likelihood computation. In Proceedings of Eurospeech (pp. 151–154).

    Google Scholar 

  • Robinson, A. J., & Fallside, F. (1991). A recurrent error propagation speech recognition system. Computer Speech and Language, 5, 259–274.

    Article  Google Scholar 

  • Robinson, T. (1994). The application of recurrent neural nets to phone probability estimation. IEEE Transactions on Neural Networks, 829–832.

  • Russell, M. J., & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for ASR. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 2376–2379).

    Google Scholar 

  • Russell, M. J., & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models for ASR. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 5–8).

    Google Scholar 

  • Schluter, R., Macherey, W., Muller, B., & Ney, H. (2001). Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Communication, 34, 287–310.

    Article  Google Scholar 

  • Schwenk, H., & Gauvain, J. L. (2000). Combining multiple speech recognizers using voting and language model information. In Proceeding international conference on spoken language processing (Vol. II, pp. 915–918). ISCA.

    Google Scholar 

  • Sha, F., & Saul, L. K. (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. I265–I268).

    Google Scholar 

  • Shannon, B. J., & Paliwal, K. K. (2006). Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. Speech Communication, 48(11), 1458–1485.

    Article  Google Scholar 

  • Sim, K. C., & Gales, M. J. F. (2006). Minimum phone error training of precision matrix models. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 882–889.

    Article  Google Scholar 

  • Trentin, E., & Gori, M. (2001). A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing, 37, 91–126.

    Article  MATH  Google Scholar 

  • Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.

    Article  Google Scholar 

  • Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898.

    Article  Google Scholar 

  • Weber, K., Ikbal, S., Bangio, S., & Bourlard, H. (2003). Robust speech recognition and feature extraction using HMM2. Computer Speech and Language, 17, 195–211.

    Article  Google Scholar 

  • Welch, L. R. (2003). HMMs and the Baum-Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.

    MathSciNet  Google Scholar 

  • Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Language, 16(1), 25–47.

    Article  Google Scholar 

  • Wu, D., Yin, Y., & Jiang, H. (2011). Large margin estimation of hidden Markov models with second order cone programming for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(6), 1652–1664.

    Article  Google Scholar 

  • Xiao, X., Li, J., Chng, E. S., Li, H., & Lee, C. H. (2010). A study on the generalization capability of acoustic models for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1158–1169.

    Article  Google Scholar 

  • Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence, 174(2), 215–243.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu, Q., Chen, B., Morgan, N., & Stolcke, A. (2005). Tandem connectionist feature extraction for conversational speech recognition. In LNCS (Vol. 3361, pp. 223–231). Springer, Berlin.

    Google Scholar 

  • Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajesh Kumar Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, R.K., Dave, M. Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). Int J Speech Technol 14, 309–320 (2011). https://doi.org/10.1007/s10772-011-9106-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-011-9106-4

Keywords

Navigation