Abstract
This paper considers the problem of finding all frequent phrase association patterns in a large collection of unstructured texts, where a phrase association pattern is a set of consecutive sequences of arbitrary number of keywords which appear together in a document. For the ordered and the unordered versions of phrase association patterns, we present efficient algorithms, called Levelwise-Scan, based on the sequential counting technique of Apriori algorithm. To cope with the problem of the huge feature space of phrase association patterns, the algorithm uses the generalized suffix tree and the pattern matching automaton. By theoretical and empirical analyses, we show that the algorithms runs quickly on most random texts for a wide range of parameter values and scales up for large disk-resident text databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. V. Aho, J. E. Hopcroft, and J. Ullman, The design and Analysis of Computer Algorithms. Addison-Wesley, 1974.
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, Chap. 12, MIT Press, 307–328, 1996.
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. the 20th VLDB, 487–499, 1994.
A. V. Aho, M. J. Corasick, Efficient string matching: An aid to bibliographic search, In CACM, 1998
H. Arimura, A. Wataki, R. Fujino, S. Arikawa, An efficient algorithm for text data mining with optimal string patterns, In Proc. ALT’98, LNAI, 247–261, 1998.
H. Arimura, S. Shimozono, Maximizing agreement with a classification by bounded or unbounded number of associated words, In Proc. ISAAC’98, LNCS, 1998. A modified version is appeared as Efficient discovery of optimal word-association patterns in large text databases, New Generation Computing, 18, 49–60, 2000.
W. Croft, H. Turtle, D. Lewis, The use of phrases and structured queries in information retrieval. In Proc. SIGIR’91, 32–45, 1991.
L. Devroye, W. Szpankowski, B. Rais, A note on the height of the suffix trees, SIAM J. Comput., 21, 48–53, 1992.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds), Advances in knowledge discovery and data mining, AAAI Press/The MIT Press, 1996.
R. Feldman and W. Kloesgen, Maximal association rules: A new tool for mining for keyword co-occurrences in document collections, In Proc. KDD-97, 167–174, 1995.
T. Kasai, T. Itai, H. Arimura, Arikawa, Exploratory document browsing using optimized text data mining, In Proc. Data Mining Workshop, 24–30, 1999 (In Japanese).
L. C. K. Lui, Color set size problem with applications to string matching. Proc. the 3rd Annual Symp. Combinatorial Pattern Matching, 1992.
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17, 115–141, 1994.
H. Mannila and H. Toivonen, Discovering generalized episodes using minimal occurrences, In Proc. KDD-96, 146–151, 1996.
E. M. McCreight, A space-economical suffix tree construction algorithm, In JACM 23, 262–272, 1976
S. Morishita, On classification and regression, Proc. DS’98, LNAI 1532, 1998.
D. Lewis, Reuters-21578 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research.att.com/~lewis/, 1997.
J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, Combinatorial pattern discovery for scientific data: Some preliminary results, In Proc. SIGMOD’94, 115–125, 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fujino, R., Arimura, H., Arikawa, S. (2000). Discovering Unordered and Ordered Phrase Association Patterns for Text Mining. In: Terano, T., Liu, H., Chen, A.L.P. (eds) Knowledge Discovery and Data Mining. Current Issues and New Applications. PAKDD 2000. Lecture Notes in Computer Science(), vol 1805. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45571-X_34
Download citation
DOI: https://doi.org/10.1007/3-540-45571-X_34
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67382-8
Online ISBN: 978-3-540-45571-4
eBook Packages: Springer Book Archive