Abstract
Hidden Markov models (HMMs) with Gaussian mixture distributions rely on an assumption that speech features are temporally uncorrelated, and often assume a diagonal covariance matrix where correlations between feature vectors for adjacent frames are ignored. A Linear Dynamic Model (LDM) is a Markovian state-space model that also relies on hidden state modeling, but explicitly models the evolution of these hidden states using an autoregressive process. An LDM is capable of modeling higher order statistics and can exploit correlations of features in an efficient and parsimonious manner. In this paper, we present a hybrid LDM/HMM decoder architecture that postprocesses segmentations derived from the first pass of an HMM-based recognition. This smoothed trajectory model is complementary to existing HMM systems. An Expectation-Maximization (EM) approach for parameter estimation is presented. We demonstrate a 13 % relative WER reduction on the Aurora-4 clean evaluation set, and a 13 % relative WER reduction on the babble noise condition.
Similar content being viewed by others
References
Digalakis, V., Rohlicek, J., & Ostendorf, M. (1993). ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4), 431–442.
Frankel, J. (2003). Linear dynamic models for automatic speech recognition. Retrieved from http://homepages.inf.ed.ac.uk/joe/pubs/2003/Frankel_thesis2003.pdf.
Frankel, J., & King, S. (2007). Speech recognition using linear dynamic models. IEEE Transactions on Speech and Audio Processing, 15(1), 246–256.
Ganapathiraju, A., Hamaker, J., & Picone, J. (2004). Applications of support vector machines to speech recognition. IEEE Transactions on Signal Processing, 52(8), 2348–2355.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallet, D., Dahlgren, N., & Zue, V. (1993). TIMIT acoustic-phonetic continuous speech corpus. The linguistic data consortium catalog. Philadelphia: The Linguistic Data Consortium. ISBN:1-58563-019-5.
Liang, F. (2003). An effective Bayesian neural network classifier with a comparison study to support vector machine. Neural Computation, 15(8), 1959–1989.
Ma, T. (2010). Linear dynamic model for continuous speech recognition. Starkville: Mississippi State University.
Parihar, N., Picone, J., Pearce, D., & Hirsch, H.-G. (2004). Performance analysis of the Aurora large vocabulary baseline system. In Proceedings of the European signal processing conference, Vienna, Austria (pp. 553–556).
Tsontzos, G., Diakoloukas, V., Koniaris, C., & Digalakis, V. (2007). Estimation of general identifiable linear dynamic models with an application in speech recognition. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (Vol. 4, pp. IV-453–IV-456).
Wöllmer, M., Klebert, N., & Schuller, B. (2011). Switching linear dynamic models for recognition of emotionally colored and noisy speech. Sprachkommunikation 2010. ITG-FB (Vol. 225, pp. 1–4). Bochum: Springer.
Author information
Authors and Affiliations
Corresponding author
Additional information
This material is based upon work supported by the National Science Foundation under Grant No. IIS-0414450. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Rights and permissions
About this article
Cite this article
Ma, T., Srinivasan, S., Lazarou, G. et al. Continuous speech recognition using linear dynamic models. Int J Speech Technol 17, 11–16 (2014). https://doi.org/10.1007/s10772-013-9200-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-013-9200-x