Abstract
The structural limitations of N-Gram models used for Language Modelling are illustrated through several examples. In most cases of interest, these limitations can be easily overcome using (general) regular or finite-state models, without having to resort to more complex, recursive devices. The problem is how to obtain the required finite-state structures from reasonably small amounts of training (positive) sentences of the considered task. Here this problem is approached through a Grammatical Inference technique known as MGGI. This allows us to easily apply a priory knowledge about the type of syntactic constraints that are relevant to the considered task to significantly improve the performance of N-Grams, using similar or smaller amounts of training data. Speech Recognition experiments are presented with results supporting the interest of the proposed approach.
Work partially supported by the Spanish CICYT under grant TIC95-0984-C02-01
Preview
Unable to display preview. Download preview PDF.
References
D. Angluin and C. H. Smith, “Inductive Inference: Theory and Methods”, Computing Surveys, 15, no. 3, pp. 46–62, 1983.
D. Angluin, “Learning regular sets from queries and counter-examples”, Information and Computation, 75, pp. 87–106, 1987.
D. Angluin, “Identifying Languages from Stochastic Examples”, YALEU/DCS/RR-614. 1988.
J. Berstel, “Transduction and Context-Free Languages”, B. G. Teubner Stuggrt, 1979.
R. C. Carrasco, J. Oncina, “Learning Stochastic Regular Grammars by Means of a State Merging Method”, Grammatical Inference and Applications, ICGI-94, pp. 139–152, 1994.
A. Castellanos, I. Galiano, E. Vidal, “Application of OSTIA to Machine Translation Tasks”, Grammatical Inference and Applications, ICGI-94, pp. 93–105, 1994.
J. A. Feldman, G. Lakoff, A. Stolcke and S. Hollbach Weber, “Miniature Language Acquisition: A touchstone for cognitive science International Computer Science Institute”, TR-90-009. 1990.
P. Garcia, E. Vidal, F. Casacuberta, “Local Languages, The successor method, and a step towards a general methodology for the inference of regular grammars”, IEEE Trans. PAMI, vol. 9, no. 6, pp. 841–845, Nov. 1987.
P. Garcia, E. Vidal, “Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition”, IEEE Trans, PAMI., vol. 12, no. 9, pp. 920–925, Sep. 1990.
M. Gold, “Language identification in the limit”, Inf. Control 10, pp. 447–474, 1967.
M. Gold, “Complexity of automaton identification from given data”, Inf. Control 37, pp. 302–320, 1978.
F. Jelinek, “Up from trigrams! The struggle for improved language Models”, EUROSPEECH 91, pp. 1037–1039, 1991.
K. J. Lang, “Random DFAs can be Approximately Learned from Sparse Uniform Examples”, COLT92.
D. Llorens, V. Jimenez, J, A. Sanchez, E. Vidal, H. Rulot, ”ATROS, an Automatically Trainable Continuous-Speech Recognition System for Limited-Domain Tasks”, Preprints of the VI Spanish Symp. of the AERFAI, Cordoba(Spain), 1995.
T. Yu. Medvedev, “On the Class of Events Representable in a Finite Automaton in Sequential Machines-Selected Papers”, ed. E. F. Moor, Addison-Wesley, pp.227–315, 1964.
J. Oncina, P. Garcia, “Inferring Regular Languages in Polynomial Update Time”, In “Pattern Recognition and Image Analysis”, Perez, Sanfeliu, Vidal (eds.), 49–61, World Scientific, 1992.
P. J. Price, “Evaluation of Spoken Language Systems: the ATIS Domain,” Proc. of 3rd DARPA Workshop on Speech and Natural Language, pp. 91–95, Hidden Valley (PA), June 1990.
E. Segarra, “Una Aproximacion Inductiva a la Comprension del Discurso Continuo”, PhD diss. Univ. Politecnica de Valencia. 1993.
A. Stolcke, “Inducing Probabilistic Grammars by Bayesian Model Merging”, Grammatical Inference and Applications, ICGI-94, Carrasco, Oncina (eds.), pp. 106–118, 1994.
E. Vidal, F. Casacuberta, P. Garcia, “Grammatical Inference and Automatic Inference Recognition”, Speech Recognition and Coding; New Advances and Trends, J.Rubio and J.M.Lopez (eds.), Springer-Verlag, 1994.
Y. Zalcstein, “Locally Testable Languages”, JCSS, 6, pp. 151–167, 1972.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vidal, E., Llorens, D. (1996). Using knowledge to improve N-Gram language modelling through the MGGI methodology. In: Miclet, L., de la Higuera, C. (eds) Grammatical Interference: Learning Syntax from Sentences. ICGI 1996. Lecture Notes in Computer Science, vol 1147. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0033353
Download citation
DOI: https://doi.org/10.1007/BFb0033353
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61778-5
Online ISBN: 978-3-540-70678-6
eBook Packages: Springer Book Archive