Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

Hockey, Beth Ann; Rayner, Manny; Christian, Gwen

doi:10.1007/978-3-540-85287-2_19

Beth Ann Hockey²,
Manny Rayner³ &
Gwen Christian⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Included in the following conference series:

International Conference on Natural Language Processing

1487 Accesses

Abstract

Statistical language models (SLMs) for speech recognition have the advantage of robustness, and grammar-based models (GLMs) the advantage that they can be built even when little corpus data is available. A known way to attempt to combine these two methodologies is first to create a GLM, and then use that GLM to generate training data for an SLM. It has however been difficult to evaluate the true utility of the idea, since the corpus data used to create the GLM has not in general been explicitly available. We exploit the Open Source Regulus platform, which supports corpus-based construction of linguistically motivated GLMs, to perform a methodologically sound comparison: the same data is used both to create an SLM directly, and also to create a GLM, which is then used to generate data to train an SLM. An evaluation on a medium-vocabulary task showed that the indirect method of constructing the SLM is in fact only marginally better than the direct one. The method used to create the training data is critical, with PCFG generation heavily outscoring CFG generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Decade of Discriminative Language Modeling for Automatic Speech Recognition

Statistical and Linguistic Knowledge Based Speech Recognition System: Language Acquisition Device for Machines

Modeling under-resourced languages for speech recognition

Article 10 February 2016

References

Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Article Google Scholar
Moore, R.: Using natural language knowledge sources in speech recognition. In: Proceedings of the NATO Advanced Studies Institute, pp. 115–129 (1998)
Google Scholar
Dowding, J., Hockey, B., Gawron, J., Culy, C.: Practical issues in compiling typed unification grammars for speech recognition. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 164–171 (2001)
Google Scholar
Rayner, M., Dowding, J., Hockey, B.: A baseline method for compiling typed unification grammars into context free language models. In: Proceedings of Eurospeech 2001, Aalborg, Denmark, pp. 729–732 (2001)
Google Scholar
Bos, J.: Compilation of unification grammars with compositional semantics to speech recognition packages. In: Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002)
Google Scholar
Stent, A., Dowding, J., Gawron, J., Bratt, E., Moore, R.: The CommandTalk spoken dialogue system. In: Proceedings of the Thirty-Seventh Annual Meeting of the Association for Computational Linguistics, pp. 183–190 (1999)
Google Scholar
Knight, S., Gorrell, G., Rayner, M., Milward, D., Koeling, R., Lewin, I.: Comparing grammar-based and robust approaches to speech understanding: a case study. In: Proceedings of Eurospeech 2001, Aalborg, Denmark, pp. 1779–1782 (2001)
Google Scholar
Rayner, M., Hockey, B., Renders, J., Chatzichrisafis, N., Farrell, K.: A voice enabled procedure browser for the International Space Station. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (interactive poster and demo track), Ann Arbor, MI (2005)
Google Scholar
Chatzichrisafis, N., Bouillon, P., Rayner, M., Santaholma, M., Starlander, M., Hockey, B.: Evaluating task performance for a unidirectional controlled language medical speech translation system. In: Proceedings of the HLT-NAACL International Workshop on Medical Speech Translation, New York, pp. 9–16 (2006)
Google Scholar
Wang, Y.-Y., Acero, A., Chelba, C., Frey, B., Wong, L.: Combination of statistical and rule-based approaches for spoken language understanding. In: Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP), Denver, CO, pp. 609–612 (2002)
Google Scholar
Jurafsky, A., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G., Morgan, N.: Using a stochastic context-free grammar as a language model for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 189–192 (1995)
Google Scholar
Jonson, R.: Generating statistical language models from interpretation grammars in dialogue systems. In: Proceedings of the 11th EACL, Trento, Italy (2006)
Google Scholar
Rayner, M., Hockey, B., Bouillon, P.: Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler. CSLI Press, Chicago (2006)
Google Scholar
Rayner, M., Bouillon, P., Chatzichrisafis, N., Hockey, B., Santaholma, M., Starlander, M., Isahara, H., Kanzaki, K., Nakao, Y.: A methodology for comparing grammar-based and robust approaches to speech understanding. In: Proceedings of the 9th International Conference on Spoken Language Processing (ICSLP), Lisboa, Portugal, pp. 1103–1107 (2005)
Google Scholar
Wang, Y.Y., Acero, A., Chelba, C.: Is Word Error Rate a good indicator for spoken language understanding accuracy. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 609–612 (2003)
Google Scholar
Bouillon, P., Rayner, M., Chatzichrisafis, N., Hockey, B., Santaholma, M., Starlander, M., Nakao, Y., Kanzaki, K., Isahara, H.: A generic multi-lingual open source platform for limited-domain medical speech translation. In: Proceedings of the 10th Conference of the European Association for Machine Translation (EAMT), Budapest, Hungary, pp. 50–58 (2005)
Google Scholar
Bouillon, P., Halimi, S., Nakao, Y., Kanzaki, K., Isahara, H., Tsourakis, N., Starlander, M., Hockey, B., Rayner, M.: Developing non-European translation pairs in a medium-vocabulary medical speech translation system. In: Proceedings of LREC 2008, Marrakesh, Morocco (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

NASA Ames Research Center, UCSC UARC, Mail Stop 19-26, Moffet Field, CA 94035
Beth Ann Hockey
University of Geneva, TIM/ISSCO, 40 bvd du Pont-d’Arve, CH-1211, Geneva 4, Switzerland
Manny Rayner
Dept of Linguistics, UC Santa Cruz, USA
Gwen Christian

Authors

Beth Ann Hockey
View author publications
You can also search for this author in PubMed Google Scholar
Manny Rayner
View author publications
You can also search for this author in PubMed Google Scholar
Gwen Christian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Chalmers University of Technology, 41296, Göteborg, Sweden
Bengt Nordström & Aarne Ranta &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hockey, B.A., Rayner, M., Christian, G. (2008). Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-85287-2_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

Abstract

Access this chapter

Preview

Similar content being viewed by others

A Decade of Discriminative Language Modeling for Automatic Speech Recognition

Statistical and Linguistic Knowledge Based Speech Recognition System: Language Acquisition Device for Machines

Modeling under-resourced languages for speech recognition

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Training Statistical Language Models from Grammar-Generated Data: A Comparative Case-Study

Abstract

Access this chapter

Preview

Similar content being viewed by others

A Decade of Discriminative Language Modeling for Automatic Speech Recognition

Statistical and Linguistic Knowledge Based Speech Recognition System: Language Acquisition Device for Machines

Modeling under-resourced languages for speech recognition

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation