skip to main content
article

A hybrid language model based on a combination of N-grams and stochastic context-free grammars

Published: 01 June 2004 Publication History

Abstract

In this paper, a hybrid language model is defined as a combination of a word-based <i>n</i>-gram, which is used to capture the local relations between words, and a category-based stochastic context-free grammar (SCFG) with a word distribution into categories, which is defined to represent the long-term relations between these categories. The problem of unsupervised learning of a SCFG in General Format and in Chomsky Normal Form by means of estimation algorithms is studied. Moreover, a bracketed version of the classical estimation algorithm based on the Earley algorithm is proposed. This paper also explores the use of SCFGs obtained from a treebank corpus as initial models for the estimation algorithms. Experiments on the UPenn Treebank corpus are reported. These experiments have been carried out in terms of the test set perplexity and the word error rate in a speech recognition experiment.

References

[1]
Amaya, F., Benedí, J., and Sánchez, J. 1999. Learning of stochastic context-free grammars from bracketed corpora by means of reestimation algorithms. In Proceedings of the VIII Spanish Symposium on Pattern Recognition and Image Analysis, M. Torres and A. Sanfeliu, Eds. AERFAI, Bilbao, España, 119--126.
[2]
Bahl, L., Jelinek, F., and Mercer, R. 1983. A maximum likelihood approach to continuous speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 5, 2, 179--190.
[3]
Baum, L. 1972. An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities 3, 1--8.
[4]
Benedí, J. and Sánchez, J. 2000. Combination of n-grams and stochastic context-free grammars for language modeling. In Proceedings of COLING. International Committee on Computational Linguistics, Saarbrücken, Germany, 55--61.
[5]
Charniak, E. 1996. Tree-bank grammars. Tech. rep., Departament of Computer Science, Brown University, Providence, Rhode Island.
[6]
Charniak, E. 2001. Immediate-head parsing for language models. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toulouse, 116--123.
[7]
Chelba, C. and Jelinek, F. 2000. Structured language modeling. Computer Speech and Language 14, 283--332.
[8]
Chen, S. 1996. Ph.D. thesis, Harvard University.
[9]
Earley, J. 1970. An efficient context-free parsing algorithm. Commun. ACM 8, 6, 451--455.
[10]
García, J., Sánchez, J., and Benedí, J. 2003. Performance and improvements of a language model based on stochastic context-free grammars. In Iberian Conference on Pattern Recognition and Image Analysis, F. Perales, A. Campilho, N. Pérez, and A. Sanfeliu, Eds. Lecture Notes in Computer Science, vol. 2652. Springer, Berlin, 271--278.
[11]
Jelinek, F. 1998. Statistical Methods for Speech Recognition. MIT Press.
[12]
Jelinek, F. and Lafferty, J. 1991. Computation of the probability of initial substring generation by stochastic context-free grammars. Comput. Linguist. 17, 3, 315--323.
[13]
Lari, K. and Young, S. 1990. The estimation of stochastic context-free grammars using the inside--outside algorithm. Comput. Speech Lang. 4, 35--56.
[14]
Linares, D., Benedí, J., and Sánchez, J. 2003a. A hybrid language model based on stochastic context-free grammars. In ECML/PKDD 2003 Workshop on Learning Context-Free Grammars, C. de la Higuera, P. Adriaans, M. van Zaanen, and J. Oncina, Eds. 41--52.
[15]
Linares, D., Benedí, J., and Sánchez, J. 2003b. Learning of stochastic context-free grammars by means of estimation algorithms and initial treebank grammars. In Iberian Conference on Pattern Recognition and Image Analysis, F. Perales, A. Campilho, N. Pérez, and A. Sanfeliu, Eds. Lecture Notes in Computer Science, vol. 2652. Springer, Berlin, 403--410.
[16]
Marcus, M., Santorini, B., and Marcinkiewicz, M. 1993. Building a large annotated corpus of English: the Penn treebank. Comput. Linguist. 19, 2, 313--330.
[17]
Martin, S., Kellner, A., and Portele, T. 2000. Interpolation of stochastic grammar and word bigram models in natural language understanding. In International Conference on Spoken Language Processing, Beijing, China.
[18]
Ney, H. 1992. Stochastic grammars and pattern recognition. In Speech Recognition and Understanding. Recent Advances, P. Laface and R. D. Mori, Eds. Springer-Verlag, New York, 319--344.
[19]
Pereira, F. and Schabes, Y. 1992. Inside--outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics. University of Delaware, 128--135.
[20]
Roark, B. 2001. Probabilistic top--down parsing and language modeling. Comput. Linguist. 27, 2, 249--276.
[21]
Rosenfeld, R. 1995. The CMU statistical language modeling toolkit and its use in the 1994 ARPA CSR evaluation. In ARPA Spoken Language Technology Workshop, Austin, Texas, USA.
[22]
Stolcke, A. 1995. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Comput. Linguist. 21, 2, 165--200.
[23]
Sánchez, J. and Benedí, J. 1999. Learning of stochastic context-free grammars by means of estimation algorithms. In Proceedings of the EUROSPEECH'99, Budapest, Hungary, vol. 4. 1799--1802.

Cited By

View all
  • (2023)Decoding Silent Speech Based on High-Density Surface Electromyogram Using Spatiotemporal Neural NetworkIEEE Transactions on Neural Systems and Rehabilitation Engineering10.1109/TNSRE.2023.326629931(2069-2078)Online publication date: 2023
  • (2016)Combination of language models for word predictionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2016.254774324:9(1477-1490)Online publication date: 1-Sep-2016
  • (2013)Opinion Mining of Movie Review using Hybrid Method of Support Vector Machine and Particle Swarm OptimizationProcedia Engineering10.1016/j.proeng.2013.02.05953(453-462)Online publication date: 2013
  • Show More Cited By

Index Terms

  1. A hybrid language model based on a combination of N-grams and stochastic context-free grammars

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian Language Information Processing
      ACM Transactions on Asian Language Information Processing  Volume 3, Issue 2
      June 2004
      82 pages
      ISSN:1530-0226
      EISSN:1558-3430
      DOI:10.1145/1034780
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 June 2004
      Published in TALIP Volume 3, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Language model
      2. stochastic context-free grammar

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Decoding Silent Speech Based on High-Density Surface Electromyogram Using Spatiotemporal Neural NetworkIEEE Transactions on Neural Systems and Rehabilitation Engineering10.1109/TNSRE.2023.326629931(2069-2078)Online publication date: 2023
      • (2016)Combination of language models for word predictionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2016.254774324:9(1477-1490)Online publication date: 1-Sep-2016
      • (2013)Opinion Mining of Movie Review using Hybrid Method of Support Vector Machine and Particle Swarm OptimizationProcedia Engineering10.1016/j.proeng.2013.02.05953(453-462)Online publication date: 2013
      • (2008)Improving phoneme and accent estimation by leveraging a dictionary for a stochastic TTS front-end2008 IEEE International Conference on Acoustics, Speech and Signal Processing10.1109/ICASSP.2008.4518703(4689-4692)Online publication date: Mar-2008
      • (2007)Extracting Grammars from RNA SequencesProceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I10.1007/978-3-540-71618-1_45(404-413)Online publication date: 11-Apr-2007
      • (2005)Corpus based learning of stochastic, context-free grammars combined with Hidden Markov Models for tRNA modellingInternational Journal of Bioinformatics Research and Applications10.1504/IJBRA.2005.0079081:3(305-318)Online publication date: 1-Sep-2005
      • (2005)Statistical and linguistic clustering for language modeling in ASRProceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications10.1007/11578079_58(556-565)Online publication date: 15-Nov-2005
      • (2005)Performance of a SCFG-based language model with training data sets of increasing sizeProceedings of the Second Iberian conference on Pattern Recognition and Image Analysis - Volume Part II10.1007/11492542_72(586-594)Online publication date: 7-Jun-2005

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media