Abstract
In this paper, we study the computational cost of extracting character n-grams from a corpus. We propose an approach for reducing this cost which is relevant especially for text mining and natural language applications. The underlying idea is to take under consideration only n-grams occurring above a given frequency in a corpus. This approach is applied to three different corpora, allowing the extraction of all frequent n-grams in those corpora in reasonable time.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Marques, N.C., Lopes, G.P.: Tagging with small training corpora. In: Hoffmann, F., Adams, N., Fisher, D., Guimarães, G., Hand, D.J. (eds.) IDA 2001. LNCS, vol. 2189, pp. 63–72. Springer, Heidelberg (2001)
Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Lipshtat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Żytkow, J.M. (ed.) PKDD 1998. LNCS, vol. 1510, pp. 65–73. Springer, Heidelberg (1998)
Brown, P., Pietra, V.D., de Souza, P., Lai, J., Mercer, R.: Class-based n-gram models of natural language. Computational Linguistics 18, 467–480 (1992)
Schütze, H.: Word space. In: Hanson, S., Cowan, J., Giles, C. (eds.) Advances in Neural Information Processing Systems 5, Morgan Kaufmann Publishers, San Francisco (1993)
Yamamoto, M., Church, K.W.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27, 1–30 (2001)
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., pp. 207–216 (1993)
Marques, N., Braud, A.: Technical Report DI-FCT/UNL 1/2003
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marques, N.C., Braud, A. (2003). Mining Generalized Character n-Grams in Large Corpora. In: Pires, F.M., Abreu, S. (eds) Progress in Artificial Intelligence. EPIA 2003. Lecture Notes in Computer Science(), vol 2902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24580-3_48
Download citation
DOI: https://doi.org/10.1007/978-3-540-24580-3_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20589-0
Online ISBN: 978-3-540-24580-3
eBook Packages: Springer Book Archive