Abstract
In this paper we present a study of automatic speech recognition systems using context-dependent phonemes and graphemes as sub-word units based on the conventional HMM/GMM system as well as tandem system. Experimental studies conducted on three different continuous speech recognition tasks show that systems using only context-dependent graphemes can yield competitive performance on small to medium vocabulary tasks when compared to a context-dependent phoneme-based automatic speech recognition system. In particular, we demonstrate the utility of tandem features that use an MLP trained to estimate phoneme posterior probabilities in improving grapheme based recognition system performance by implicitly incorporating phonemic knowledge into the system without having to define a phonetically transcribed lexicon.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kanthak, S., Ney, H.: Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition. In: Proceedings of Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp. 845–848 (2002)
Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. In: Proceedings of Eurospeech, pp. 3141–3144 (2003)
Magimai.-Doss, M., Stephenson, T.A., Bourlard, H., Bengio, S.: Phoneme-Grapheme based automatic speech recognition system. In: Proceedings of Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 94–98 (2003)
Schukat-Talamazzini, E.G., Niemann, H., Eckert, W., Kuhn, T., Rieck, S.: Automatic speech recognition without phonemes. In: Eurospeech, pp. 129–132 (1993)
Magimai.-Doss, M., Bengio, S., Bourlard, H.: Joint decoding for phoneme-grapheme continuous speech recognition. In: ICASSP. Proceedings of Int. Conf. Acoustics, Speech and Signal Processing, pp. I–177–I–180 (2004)
Hermansky, H.: Perceptual Linear Predictive (PLP) analysis of speech. Journal of Acoustical Society of America 87(4), 1738–1752 (1990)
Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature stream extraction for conventional HMM systems. In: ICASSP. Proceedings of Int. Conf. Acoustics, Speech and Signal Processing, pp. III–1635–1638 (2000)
Cole, R.A., Fanty, M., Noel, M., Lander, T.: Telephone speech corpus development at CSLU. In: ICSLP 1994. Proceedings of Int. Conf. Spoken Language Processing (1994)
Price, P.J., Fisher, W., Bernstein, J.: A database for continuous speech recognition in a 1000 word domain. In: ICASSP 1988. Proceedings of Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 651–654 (1988)
Chen, B., Çetin, Ö., Doddington, G., Morgan, N., Ostendorf, M., Shinozaki, T., Zhu, Q.: A CTS task for meaningful fast-turnaround experiments. In: Proceedings of Rich Transcription Fall Workshop, Palisades, NY (2004)
Black, A.W., Lenzo, K., Pagel, V.: Issues in building general letter to sound rules. In: Proceedings of 3rd ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia, pp. 77–80 (1998)
Odell, J.J.: The use of context in large vocabulary continuous speech recognition. PhD thesis, Queens College, University of Cambridge (1995)
Ciprian, C., Morton, R.: Mutual information phone clustering for decision tree induction. In: ICSLP 2002. Proceedings of Int. Conf. Spoken Language Processing, Denver, Collorado (2002)
Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: On using MLP features in lvcsr. In: ICSLP 2004. Proceedings of Int. Conf. Spoken Language Processing, Korea (2004)
Ikbal, S., Misra, H., Sivadas, S., Hermansky, H., Bourlard, H.: Entropy based combination of tandem representations for robust speech recognition. In: ICSLP 2004. Proceedings of Int. Conf. Spoken Language Processing, Korea (2004)
Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: Hidden Markov model toolkit V3.2.1 reference manual. Technical report, Speech group, Engineering Department, Cambridge University, UK (2002)
Mirghafori, N., Morgan, N.: Combining connectionist multi-band and full-band probability streams for speech recognition of natural numbers. In: Proceedings of Int. Conf. Spoken Language Processing, pp. 743–746 (1998)
Stolcke, A., Grézl, F., Hwang, M.Y., Lei, X., Morgan, N., Vergyri, D.: Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptrons. In: ICASSP 2006. Proceedings of Int. Conf. on Acoustics, Speech and Signal Processing, Toulouse, France (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dines, J., Magimai Doss, M. (2008). A Study of Phoneme and Grapheme Based Context-Dependent ASR Systems. In: Popescu-Belis, A., Renals, S., Bourlard, H. (eds) Machine Learning for Multimodal Interaction. MLMI 2007. Lecture Notes in Computer Science, vol 4892. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78155-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-78155-4_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78154-7
Online ISBN: 978-3-540-78155-4
eBook Packages: Computer ScienceComputer Science (R0)