Abstract
In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information.
The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous back-off automatically identifies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proceedings of the 34th Annual Meeting of the ACL, pp. 310–318. ACL (June 1996)
Daelemans, W., Van den Bosch, A., Zavrel, J.: Forgetting exceptions is harmful in language learning. Machine Learning, Special issue on Natural Language Learning 34, 11–41 (1999)
de la Higuera, C.: Grammatial Inference, Learning Automata and Grammars. Cambridge University Press, Cambridge (2010)
Knuth, D.E.: The art of computer programming. Sorting and searching, vol. 3. Addison-Wesley, Reading (1973)
Stehouwer, H., Van den Bosch, A.: Putting the t where it belongs: Solving a confusion problem in Dutch. In: Verberne, S., van Halteren, H., Coppen, P.A. (eds.) Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, pp. 21–36. Nijmegen, The Netherlands (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stehouwer, H., van Zaanen, M. (2010). Enhanced Suffix Arrays as Language Models: Virtual k-Testable Languages. In: Sempere, J.M., GarcÃa, P. (eds) Grammatical Inference: Theoretical Results and Applications. ICGI 2010. Lecture Notes in Computer Science(), vol 6339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15488-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-15488-1_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15487-4
Online ISBN: 978-3-642-15488-1
eBook Packages: Computer ScienceComputer Science (R0)