Abstract
In some languages like Finnish or Hungarian phone duration is a very important distinctive acoustic cue. The conventional HMM speech recognition framework, however, is known to poorly model the duration information. In this paper we compare different duration models within the framework of HMM/ANN hybrids. The tests are performed with two different hybrid models, the conventional one and the “averaging hybrid” recently proposed. Independent of the model configuration, we report that the usual exponential duration model has no detectable advantage over using no duration model at all. Similarly, applying the same fixed value for all state transition probabilities, as is usual with HMM/ANN systems, is found to have no influence on the performance. However, the practical trick of imposing a minimum duration on the phones turns out to be very useful. The key part of the paper is the introduction of the gamma distribution duration model, which proves clearly superior to the exponential one, yielding a 12-20% relative improvement in the word error rate, thus justifying the use of sophisticated duration models in speech recognition.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition – A Hybrid Approach. Kluwer Academic, Dordrecht (1994)
Bourlard, H., Hermansky, H., Morgan, N.: Towards Increasing Speech Recognition Error Rates. Speech Communication 18, 205–231 (1996)
Hagen, A., Morris, A.: Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR. Computer Speech and Language 19, 3–30 (2005)
Huang, X.D., Acero, A., Hon, H.-W.: Spoken Language Processing. Prentice-Hall, Englewood Cliffs (2001)
Huyer, W., Neumaier, A.: SNOBFIT - Stable Noisy Optimization by Branch and Fit (submitted for Publication)
Morris, A.C., Payne, S., Bourlard, H.: Low Cost Duration Modelling for Noise Robust Speech Recognition. In: Proc. ICSLP 2002, pp. 1025–1028 (2002)
Pylkönnen, J., Kurimo, M.: Duration Modeling Techniques for Continuous Speech Recognition. In: Proc. ICSLP 2004, pp. 385–388 (2004)
Tax, D.M.J., van Breukelen, M., Duin, R.P.W., Kittler, J.: Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33, 1475–1485 (2000)
Tóth, L., Kocsor, A.: Lessons from a Segment-Based Interpretation of HMM/ANN Hybrids. Speech Communication (submitted to)
Vicsi, K., Tóth, L., Kocsor, A., Csirik, J.: MTBA – A Hungarian Telephone Speech Database. Híradástechnika LVII (8), 35–43 (2002) (in Hungarian)
Young, S., et al.: The HMM Toolkit (HTK) – software and manual, http://htk.eng.cam.ac.uk
NIST/SEMATECH e-Handbook of Stat.Methods, http://www.itl.nist.gov/div898/handbook/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tóth, L., Kocsor, A. (2005). Explicit Duration Modelling in HMM/ANN Hybrids. In: Matoušek, V., Mautner, P., Pavelka, T. (eds) Text, Speech and Dialogue. TSD 2005. Lecture Notes in Computer Science(), vol 3658. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551874_40
Download citation
DOI: https://doi.org/10.1007/11551874_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28789-6
Online ISBN: 978-3-540-31817-0
eBook Packages: Computer ScienceComputer Science (R0)