Abstract
We describe an efficient implementation of a text mining algorithm for discovering a class of simple string patterns. With an index structure, called the virtual suffix tree, for pattern discovery built on the top of the suffix array, the resulting algorithm is simple and fast in practice compared with the previous implementation with the suffix tree.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
H. Arimura, J. Abe, R. Fujino, H. Sakamoto, S. Shimozono, S. Arikawa, Text Data Mining: Discovery of Important Keywords in the Cyberspace, In Proc. IEEE Kyoto Int’l Conf. Digital Library, 2001. (to appear)
H. Arimura, S. Arikawa, S. Shimozono, Efficient discovery of optimal word-association patterns in large text databases, New Generation Computing, Special issue on Discovery Science, 18, 49–60, 2000.
Arimura, H., Wataki, A., Fujino, R., Arikawa, S., A fast algorithm for discovering optimal string patterns in large text databases, In Proc. the 9th Int. Workshop on Algorithmic Learning Theory (ALT’98), LNAI 1501, 247–261, 1998.
L. Devroye, L. Gyorfi, G. Lugosi, A Probablistic Theory of Pattern Recognition, Springer-Verlag,1996.
G. Gonnet, R. Baeza-Yates and T. Snider, New indices for text: Pat trees and pat arrays, In William Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, 66–82, 1992.
D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997.
T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park, Linear-time longest-common-prefix computation in suffix arrays and its applications, In Proc. CPM’01, LNCS, Springer-Verlag, 2000 (this volumn). (A part of this work is also available as: T. Kasai, H. Arimura, S. Arikawa, Efficient substring traversal with suffix arrays, DOI-TR 185, 2001, ftp://ftp.i.kyushu-u.ac.jp/pub/tr/trcs185.ps.gz.)
M.J. Kearns, R.E. Shapire, L.M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2-3), 115–141, 1994.
D. Lewis, Reuters-2157-8 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research.att.com/~lewis/, 1997.
E.M. McCreight, A space-economical suffix tree construction algorithm, JACM, 23(2):262–272, 1976.
U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 1991.
U. Manber and G. Myers, Suffix arrays: A new method for on-line string searches, SIAM J. Computing, 22(5), 935–948 (1993).
S. Morishita, On classification and regression, In Proc. Discovery Science’ 98, LNAI 1532, 49–59, 1998.
B. Schieber and U. Vishkin, On finding lowest common ancestors: simplifications an parallelization, SIAM J. Computing, 17, 1253–1262, 1988.
J.T.L. Wang, G.W. Chirn, T.G. Marr, B. Shapiro, D. Shasha and K. Zhang, In Proc. SIGMOD’94, 115–125, 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Arimura, H., Asaka, H., Sakamoto, H., Arikawa, S. (2001). Efficient Discovery of Proximity Patterns with Suffix Arrays (Extended Abstract). In: Amir, A. (eds) Combinatorial Pattern Matching. CPM 2001. Lecture Notes in Computer Science, vol 2089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48194-X_14
Download citation
DOI: https://doi.org/10.1007/3-540-48194-X_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42271-6
Online ISBN: 978-3-540-48194-2
eBook Packages: Springer Book Archive