Discovering Unordered and Ordered Phrase Association Patterns for Text Mining

Fujino, Ryoichi; Arimura, Hiroki; Arikawa, Setsuo

doi:10.1007/3-540-45571-X_34

Ryoichi Fujino⁴,
Hiroki Arimura^5,6 &
Setsuo Arikawa⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1805))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1710 Accesses
6 Citations

Abstract

This paper considers the problem of finding all frequent phrase association patterns in a large collection of unstructured texts, where a phrase association pattern is a set of consecutive sequences of arbitrary number of keywords which appear together in a document. For the ordered and the unordered versions of phrase association patterns, we present efficient algorithms, called Levelwise-Scan, based on the sequential counting technique of Apriori algorithm. To cope with the problem of the huge feature space of phrase association patterns, the algorithm uses the generalized suffix tree and the pattern matching automaton. By theoretical and empirical analyses, we show that the algorithms runs quickly on most random texts for a wide range of parameter values and scales up for large disk-resident text databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. V. Aho, J. E. Hopcroft, and J. Ullman, The design and Analysis of Computer Algorithms. Addison-Wesley, 1974.
Google Scholar
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, Chap. 12, MIT Press, 307–328, 1996.
Google Scholar
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. the 20th VLDB, 487–499, 1994.
Google Scholar
A. V. Aho, M. J. Corasick, Efficient string matching: An aid to bibliographic search, In CACM, 1998
Google Scholar
H. Arimura, A. Wataki, R. Fujino, S. Arikawa, An efficient algorithm for text data mining with optimal string patterns, In Proc. ALT’98, LNAI, 247–261, 1998.
Google Scholar
H. Arimura, S. Shimozono, Maximizing agreement with a classification by bounded or unbounded number of associated words, In Proc. ISAAC’98, LNCS, 1998. A modified version is appeared as Efficient discovery of optimal word-association patterns in large text databases, New Generation Computing, 18, 49–60, 2000.
Google Scholar
W. Croft, H. Turtle, D. Lewis, The use of phrases and structured queries in information retrieval. In Proc. SIGIR’91, 32–45, 1991.
Google Scholar
L. Devroye, W. Szpankowski, B. Rais, A note on the height of the suffix trees, SIAM J. Comput., 21, 48–53, 1992.
Article MATH MathSciNet Google Scholar
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds), Advances in knowledge discovery and data mining, AAAI Press/The MIT Press, 1996.
Google Scholar
R. Feldman and W. Kloesgen, Maximal association rules: A new tool for mining for keyword co-occurrences in document collections, In Proc. KDD-97, 167–174, 1995.
Google Scholar
T. Kasai, T. Itai, H. Arimura, Arikawa, Exploratory document browsing using optimized text data mining, In Proc. Data Mining Workshop, 24–30, 1999 (In Japanese).
Google Scholar
L. C. K. Lui, Color set size problem with applications to string matching. Proc. the 3rd Annual Symp. Combinatorial Pattern Matching, 1992.
Google Scholar
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17, 115–141, 1994.
MATH Google Scholar
H. Mannila and H. Toivonen, Discovering generalized episodes using minimal occurrences, In Proc. KDD-96, 146–151, 1996.
Google Scholar
E. M. McCreight, A space-economical suffix tree construction algorithm, In JACM 23, 262–272, 1976
Article MATH MathSciNet Google Scholar
S. Morishita, On classification and regression, Proc. DS’98, LNAI 1532, 1998.
Google Scholar
D. Lewis, Reuters-21578 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research.att.com/~lewis/, 1997.
J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, Combinatorial pattern discovery for scientific data: Some preliminary results, In Proc. SIGMOD’94, 115–125, 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Nippon Steel Information and Communication Systems, Inc., Kitakyushu, 804-0001, Japan
Ryoichi Fujino
PRESTO, Japan Science and Technology Corporation, Japan
Hiroki Arimura
Dept. Informatics, Kyushu Univ., Fukuoka, 812-8581, Japan
Hiroki Arimura & Setsuo Arikawa

Authors

Ryoichi Fujino
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Systems Management, Universiy of Tsukuba, 3-29-1 Otsuka, Bunkyo-ku, Tokyo, 112-0012, Japan
Takao Terano
Department of Computer Science and Engineering, Arizona State University, P.O. Box 875 406, Tempe, AZ, 85287-5406
Huan Liu
Department of Computer Science, National Tsing Hua University, Hsinchu, 300, Taiwan ROC
Arbee L. P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fujino, R., Arimura, H., Arikawa, S. (2000). Discovering Unordered and Ordered Phrase Association Patterns for Text Mining. In: Terano, T., Liu, H., Chen, A.L.P. (eds) Knowledge Discovery and Data Mining. Current Issues and New Applications. PAKDD 2000. Lecture Notes in Computer Science(), vol 1805. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45571-X_34

Download citation

DOI: https://doi.org/10.1007/3-540-45571-X_34
Published: 24 March 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67382-8
Online ISBN: 978-3-540-45571-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics