research-article

Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

Authors:
Chuancong Gao

Saarland University, Saarbrücken, Germany

Saarland University, Saarbrücken, Germany
View Profile

,
Sebastian Michel

Saarland University, Saarbrücken, Germany

Saarland University, Saarbrücken, Germany
View Profile

EDBT '12: Proceedings of the 15th International Conference on Extending Database TechnologyMarch 2012Pages 264–275https://doi.org/10.1145/2247596.2247628

Published:27 March 2012Publication History

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Pages 264–275

ABSTRACT

In this paper we consider the problem of mining frequently occurring interesting phrases in large document collections in an ad-hoc fashion. Ad-hoc refers to the ability to perform such analyses over text corpora that can be an arbitrary subset of a global set of documents. Most of the times the identification of these ad-hoc document collections is driven by a user or application defined query with the aim of gathering statistics describing the sub-collection, as a starting point for further data analysis tasks. Our approach to mine the top-k most interesting phrases consists of a novel indexing technique, called Sequence Pattern Indexing (SeqPattIndex), that benefits from the observation that phrases often overlap sequentially. We devise a forest based index for phrases and an further improved version with additional redundancy elimination power. The actual top-k phrase mining algorithm operating on these indices is a combination of a simple merge join and inspired by the pattern-growth framework from the data mining community, making use of early termination and search space pruning technologies that enhance the runtime performance. Overall, our approach has on average a lower index space consumption as well as a lower runtime for the top-k phrase mining task, as we demonstrate in the experimental evaluation using real-world data.

References

The dblp computer science bibliography. http://www.informatik.uni-trier.de/ley/db/.Google Scholar
National science foundation awards. http://www.nsf.gov/awardsearch/.Google Scholar
Pubmed. http://www.ncbi.nlm.nih.gov/pubmed/.Google Scholar
R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D. C., 1993. ACM Press. Google ScholarDigital Library
R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3--14, Taipei, Taiwan, 1995. IEEE Computer Society. Google ScholarDigital Library
S. J. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. PVLDB, 3(1):1348--1357, 2010. Google ScholarDigital Library
H. Cheng, X. Yan, J. Han, and C.-W. Hsu. Discriminative frequent pattern analysis for effective classification. In Proceedings of the 23rd International Conference on Data Engineering, pages 716--725, Istanbul, Turkey, 2007. IEEE.Google ScholarCross Ref
H. Cheng, X. Yan, J. Han, and P. S. Yu. Direct discriminative pattern mining for effective classification. In Proceedings of the 24th International Conference on Data Engineering, pages 169--178, Cancún, México, 2008. IEEE. Google ScholarDigital Library
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 670--681, Baltimore, Maryland, USA, 2005. ACM. Google ScholarDigital Library
W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct mining of discriminative and essential frequent patterns via model-based search tree. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 230--238, Las Vegas, Nevada, USA, 2008. ACM. Google ScholarDigital Library
C. Gao and J. Wang. Direct mining of discriminative patterns for classifying uncertain data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 861--870, Washington, DC, USA, 2010. ACM. Google ScholarDigital Library
M. A. Hearst. Clustering versus faceted categories for information exploration. Commun. ACM, 49(4):59--61, 2006. Google ScholarDigital Library
H. Kim, S. Kim, T. Weninger, J. Han, and T. F. Abdelzaher. Ndpmine: Efficiently mining discriminative numerical features for pattern-based classification. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, pages 35--50, Barcelona, Spain, 2010. Springer. Google ScholarDigital Library
J. M. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 91--101, Edmonton, Alberta, Canada, 2002. ACM. Google ScholarDigital Library
B. Lent, R. Agrawal, and R. Srikant. Discovering trends in text databases. In KDD, pages 227--230, 1997.Google Scholar
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceedings of the Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 80--86, 1998.Google Scholar
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. In Proceedings of the 17th International Conference on Data Engineering, pages 215--224, Heidelberg, Germany, 2001. IEEE Computer Society. Google ScholarDigital Library
A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Multidimensional content exploration. PVLDB, 1(1):660--671, 2008. Google ScholarDigital Library
Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: shrinking the petacube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 464--475, Madison, Wisconsin, 2002. ACM. Google ScholarDigital Library
J. Wang and G. Karypis. On mining instance-centric classification rules. IEEE Trans. Knowl. Data Eng., 18(11):1497--1511, 2006. Google ScholarDigital Library
Y. Yang, N. Bansal, W. Dakka, P. G. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, pages 34--43, Barcelona, Spain, 2009. ACM. Google ScholarDigital Library

Index Terms

Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing
1. Information systems
  1. Information systems applications

Recommendations

Interesting-phrase mining for ad-hoc text analytics

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and ...
Read More
Interesting pattern mining in multi-relational data

Mining patterns from multi-relational data is a problem attracting increasing interest within the data mining community. Traditional data mining approaches are typically developed for single-table databases, and are not directly applicable to multi-...
Read More
A method for mining top-rank-k frequent closed itemsets
Collective intelligent information and database systems

Mining frequent closed itemsets (FCIs) is important in mining non-redundant (minimal) association rules. Therefore, many algorithms have been developed for mining FCIs with reduced mining time and memory usage. For mining FCIs, algorithms use the minimum ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
March 2012
643 pages
ISBN:9781450307901
DOI:10.1145/2247596
Editors:
Elke Rundensteiner
Worcester Polytechnic Institute
,
Volker Markl
Technische Universität Berlin, Germany
,
Ioana Manolescu
INRIA, France
,
Sihem Amer-Yahia
QCRI, Doha, Qatar
,
Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
,
Ismail Ari
Ozyegin University, Turkey
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 152
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Interesting-phrase mining for ad-hoc text analytics

Interesting pattern mining in multi-relational data

A method for mining top-rank-k frequent closed itemsets