Abstract
In automatic speech recognition (ASR) systems, hidden Markov models (HMMs) have been widely used for modeling the temporal speech signal. As discussed in Part I, the conventional acoustic models used for ASR have many drawbacks like weak duration modeling and poor discrimination. This paper (Part II) presents a review on the techniques which have been proposed in literature for the refinements of standard HMM methods to cope with their limitations. Current advancements related to this topic are also outlined. The approaches emphasized in this part of review are connectionist approach, explicit duration modeling, discriminative training and margin based estimation methods. Further, various challenges and performance issues such as environmental variability, tied mixture modeling, and handling of distant speech signals are analyzed along with the directions for future research.
Similar content being viewed by others
References
Axelrod, S., & Maison, B. (2004). Combination of hidden Markov models with dynamic time warping for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 173–176).
Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1986). Maximum mutual information estimation of hidden Markov model parameter for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 49–52), Tokyo.
Benouareth, A., Ennaji, A., & Sellami, M. (2008). Semi-continuous HMMs with explicit state duration for unconstrained Arabic word modeling and recognition. Pattern Recognition Letters, 29, 1742–1752.
Beyerlein, P. (1997). Discriminative model combination. In Proceedings of IEEE automatic speech recognition and understanding workshop (pp. 238–245), Santa Barbara, CA.
Bilmes, J. A. (1999). Buried Markov models for speech recognition. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 2, pp. 713–716).
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer, Berlin.
Bocchieri, E., & Mak, B. K. (2001). Subspace distribution clustering hidden Markov model. IEEE Transactions on Speech and Audio Processing, 9(3), 264–275.
Bonafonte, A., Ros, X., & Marino, J. B. (1993). An efficient algorithm to find the best sequence in HSMM. In Proceedings of Eurospeech (pp. 1547–1550).
Bourlard, H., & Morgan, N. (1994). Connectionist speech recognition: A hybrid approach. Kluwer Academic, Boston.
Bourlard, H., & Wellakens, C. (1990). Links between Markov models and multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(12), 1167–1178.
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.
Cai, J., Bouselmi, G., Laprie, Y., & Haton, J. P. (2009). Efficient likelihood evaluation and dynamic Gaussian selection for HMM-based speech recognition. Computer Speech and Language, 23, 147–164.
Chan, A., Sherwani, J., Ravishankar, M., & Rudnicky, A. (2004). Four-layer categorization scheme of fast GMM computation techniques in large vocabulary continuous speech recognition systems. In Proceedings of Interspeech (pp. 689–692).
Chen, J., Benesty, J., Huang, Y., & Doclo, S. (2006). New insights into the noise reduction Wiener filter. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1218–1234.
De Wachter, M., Matton, M., et al. (2007). Template based continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1377–1390.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithms. Journal of the Royal Statistical Society Series B, 39, 1–38.
Deng, L. (1992). A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal. Signal Processing, 27, 65–78.
Deng, L. (2006). Dynamic speech models: Theory, applications, and algorithms. San Rafael: Morgan and Claypool.
Deng, L., Acero, A., Plumpe, M., & Huang, X. (2000). Large vocabulary speech recognition under adverse acoustic environments. In Proceedings of Interspeech (Vol. 13, pp. 806–809).
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. Wiley, New York.
Ejbali, R., Zaied, M., & Ben Amar, C. (2010). Wavelet network for recognition system of Arabic word. International Journal of Speech Technology, 13, 163–174.
Ellis, D. P. W., & Bilmes, J. A. (2000). Using mutual information to design feature combinations. In Proceedings international conference on spoken language processing (Vol. 3, pp. 79–82).
Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32, 41–62.
Fiscus, J. (1997). A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In Proceedings of IEEE automatic speech recognition and understanding workshop (pp. 347–352), Santa Barbara.
Gales, M., & Woodland, P. C. (1996). Mean and variance adaptation within the MLLR framework. Computer Speech and Language, 10, 249–264.
Gales, M., & Young, S. (1996). Robust continuous speech recognition using parallel model combination. IEEE Transactions on Speech and Audio Processing, 4(5), 352–359.
Gales, M., & Young, S. (2007). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.
Garau, G., & Renals, S. (2008). Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 508–518.
Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 13–16).
Hagen, A., & Morris, A. (2005). Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language, 19, 3–30.
Hagen, A., & Neto, J. (2003). Multi-stream processing using context-independent and context-dependent hybrid systems. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 277–280).
Halberstadt, D., & Glass, J. (1998). Heterogeneous measurements and multiple classifiers for speech recognition. In Proceedings IEEE international conference on spoken language processing (pp. 995–998), Sydney, Australia, ISCA.
He, X., & Deng, L. (2007). A new look at discriminative training for HMM. Pattern Recognition Letters, 28, 1285–1294.
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Hermansky, H., & Sharma, S. (1999). Temporal patterns (TRAPs) in ASR of noisy speech. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. 289–292).
Hickok, G., & Poippil, D. (2007). The critical organization of speech processing. Nature Reviews. Neuroscience, 8(5), 393–402.
Hirsch, H. G., & Finster, H. (2008). A new approach for the adaptation of HMMs to reverberation and background noise. Speech Communication, 50, 244–263.
Hughes, T., Kim, B., DiBiase, H. J. H., & Silverman, H. F. (1999). Performance of an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Transactions on Speech and Audio Processing, 7(3), 346–349.
Janin, A., Ellis, D., & Morgan, N. (1999). Multi-stream speech recognition: Ready for prime time. In Proc. Eurospeech (pp. 591–594), Budapest, Hungary, ISCA.
Jiang, H. (2010). Discriminative training of HMM for automatic speech recognition: A survey. Computer Speech and Language, 24, 589–608.
Jiang, H., & Li, X. (2007). Incorporating training errors for large margin HMMs under semi definite programming framework. In Proceedings IEEE international conference on acoustics, speech and signal processing (Vol. 4, pp. 629–632).
Jiang, H., Li, X., & Liu, C. (2006). Large margin hidden Markov models for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1584–1589.
Katagiri, S., Juang, B. H., & Lee, C. H. (1998). Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method. Proceedings of the IEEE, 86(11), 2345–2373.
Kenny, P., Lennig, M., & Mermelstein, P. (1990). A linear predictive HMM for vector-valued observations with applications to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(2), 220–225.
Kim, D. Y., Umesh, S., Gales, M. J. F., Hain, T., & Woodland, P. (2004). Using VTLN for broadcast news transcription. In Proceedings of international conference spoken language processing, Jeju, Korea.
Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239.
Kochler, J., Morgan, N., Hermansky, H., Hirsch, H. G., & Tong, G. (1994). Integrating RASTA-PLP into speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 421–424).
Kohonen, T. (1990). The self organizing map. Proceedings of the IEEE, 78(9), 1464–1480.
Kumar, N., & Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26, 283–297.
Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23–43.
Levinson, S. E. (1986). Continuously variable duration hidden Markov models for auto speech Recognition. Computer Speech and Language, 1(1), 29–45.
Li, X., & Jiang, H. (2005). A constrained joint optimization method for large margin HMM estimation. In Proc. IEEE automatic speech recognition and understanding workshop (pp. 151–156).
Li, J., Yuan, M., & Lee, C. H. (2006). Soft margin estimation of hidden Markov model parameters. In Proc. Interspeech (pp. 2422–2425).
Li, J., Yan, Z., Lee, C. H., & Wang, R. H. (2007). A study on soft margin estimation for LVCSR. In Proc. IEEE automatic speech recognition and understanding workshop (pp. 268–271).
Livescu, K., Glass, J., & Bilmes, J. (2003). Hidden feature models for speech recognition using dynamic Bayesian networks. In Proceedings of Eurospeech (pp. 2529–2532).
Mari, J. F., & Haton, J. P. (1997). Automatic word recognition based on second order Hidden Markov Models. IEEE Transactions on Speech and Audio Processing, 5(1), 22–25.
Moreno, P. J., Raj, B., & Stern, R. M. (1996). A vector Taylor series approach for environment independent speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 733–736).
Morgan, N., & Bourlard, H. (1995). Continuous speech recognition: An introduction to hybrid/connectionist approach. IEEE Signal Processing Magazine, 25–42.
Nefian, A., Liang, L., Pi, X., Liu, X., & Murphy, K. (2002). Dynamic Bayesian network for audio visual speech recognition. EURASIP Journal on Applied Signal Processing, 1, 1274–1288.
Neumeyer, L., Sankar, A., & Digalakis, V. (1995). A comparative study of speaker adaptation techniques. In In Eurospeech (pp. 1127–1130), Madrid, Spain.
Niles, L., & Silverman, H. (1990). Combining hidden Markov models and neural network classifiers. In ICASSP (pp. 417–420).
Nock, H., & Young, S. (2000). Loosely coupled HMMs for ASR. In Proc. international conference on speech and language processing (ICSLP), Beijing, China.
Ostendorf, M., Kannan, A., Kimball, O., & Rohlicek, J. R. (1992). Continuous word recognition based on the stochastic segment model. In Proc. DARPA workshop on CSR (pp. 53–58).
Paliwal, K. K. (1987). A speech enhancement method based on Kalman filtering. In Proceeding of IEEE ICASSP (pp. 177–180).
Poritz, A. B. (1988). Hidden Markov models: A guided tour. In Proc. IEEE international conference on acoustic, speech and signal processing (Vol. 11, pp. 7–13), New York.
Puurula, A., & Van Compernolla, D. (2010). Dual stream speech recognition using articulatory syllable models. International Journal of Speech Technology, 13, 219–230.
Pye, D., & Woodland, P. C. (1997). Experiments in speaker normalization and adaptation for large vocabulary speech recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 1047–1050), Munich, Germany.
Pylkkonen, J., & Kurimo, M. (2003). Duration modeling techniques for continuous speech recognition. In Proceedings of European conference on speech technology (EUROSPEECH).
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Ravishankar, M., Bisiani, R., & Thayer, E. (1997). Sub-vector clustering to improve memory and speed performance of acoustic likelihood computation. In Proceedings of Eurospeech (pp. 151–154).
Robinson, A. J., & Fallside, F. (1991). A recurrent error propagation speech recognition system. Computer Speech and Language, 5, 259–274.
Robinson, T. (1994). The application of recurrent neural nets to phone probability estimation. IEEE Transactions on Neural Networks, 829–832.
Russell, M. J., & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for ASR. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 2376–2379).
Russell, M. J., & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models for ASR. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. 5–8).
Schluter, R., Macherey, W., Muller, B., & Ney, H. (2001). Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Communication, 34, 287–310.
Schwenk, H., & Gauvain, J. L. (2000). Combining multiple speech recognizers using voting and language model information. In Proceeding international conference on spoken language processing (Vol. II, pp. 915–918). ISCA.
Sha, F., & Saul, L. K. (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. In Proceedings IEEE international conference on acoustics, speech, and signal processing (pp. I265–I268).
Shannon, B. J., & Paliwal, K. K. (2006). Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. Speech Communication, 48(11), 1458–1485.
Sim, K. C., & Gales, M. J. F. (2006). Minimum phone error training of precision matrix models. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 882–889.
Trentin, E., & Gori, M. (2001). A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing, 37, 91–126.
Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25, 133–147.
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(12), 1888–1898.
Weber, K., Ikbal, S., Bangio, S., & Bourlard, H. (2003). Robust speech recognition and feature extraction using HMM2. Computer Speech and Language, 17, 195–211.
Welch, L. R. (2003). HMMs and the Baum-Welch algorithms. IEEE Information Theory Society Newsletter, 53(4), 10–13.
Woodland, P. C., & Povey, D. (2002). Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Language, 16(1), 25–47.
Wu, D., Yin, Y., & Jiang, H. (2011). Large margin estimation of hidden Markov models with second order cone programming for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(6), 1652–1664.
Xiao, X., Li, J., Chng, E. S., Li, H., & Lee, C. H. (2010). A study on the generalization capability of acoustic models for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1158–1169.
Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence, 174(2), 215–243.
Zhu, Q., Chen, B., Morgan, N., & Stolcke, A. (2005). Tandem connectionist feature extraction for conversational speech recognition. In LNCS (Vol. 3361, pp. 223–231). Springer, Berlin.
Zolney, A., Kocharov, D., Schluter, R., & Ney, H. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aggarwal, R.K., Dave, M. Acoustic modeling problem for automatic speech recognition system: advances and refinements (Part II). Int J Speech Technol 14, 309–320 (2011). https://doi.org/10.1007/s10772-011-9106-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-011-9106-4