ABSTRACT
In this paper we consider the problem of mining frequently occurring interesting phrases in large document collections in an ad-hoc fashion. Ad-hoc refers to the ability to perform such analyses over text corpora that can be an arbitrary subset of a global set of documents. Most of the times the identification of these ad-hoc document collections is driven by a user or application defined query with the aim of gathering statistics describing the sub-collection, as a starting point for further data analysis tasks. Our approach to mine the top-k most interesting phrases consists of a novel indexing technique, called Sequence Pattern Indexing (SeqPattIndex), that benefits from the observation that phrases often overlap sequentially. We devise a forest based index for phrases and an further improved version with additional redundancy elimination power. The actual top-k phrase mining algorithm operating on these indices is a combination of a simple merge join and inspired by the pattern-growth framework from the data mining community, making use of early termination and search space pruning technologies that enhance the runtime performance. Overall, our approach has on average a lower index space consumption as well as a lower runtime for the top-k phrase mining task, as we demonstrate in the experimental evaluation using real-world data.
- The dblp computer science bibliography. http://www.informatik.uni-trier.de/ley/db/.Google Scholar
- National science foundation awards. http://www.nsf.gov/awardsearch/.Google Scholar
- Pubmed. http://www.ncbi.nlm.nih.gov/pubmed/.Google Scholar
- R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D. C., 1993. ACM Press. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3--14, Taipei, Taiwan, 1995. IEEE Computer Society. Google ScholarDigital Library
- S. J. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. PVLDB, 3(1):1348--1357, 2010. Google ScholarDigital Library
- H. Cheng, X. Yan, J. Han, and C.-W. Hsu. Discriminative frequent pattern analysis for effective classification. In Proceedings of the 23rd International Conference on Data Engineering, pages 716--725, Istanbul, Turkey, 2007. IEEE.Google ScholarCross Ref
- H. Cheng, X. Yan, J. Han, and P. S. Yu. Direct discriminative pattern mining for effective classification. In Proceedings of the 24th International Conference on Data Engineering, pages 169--178, Cancún, México, 2008. IEEE. Google ScholarDigital Library
- G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 670--681, Baltimore, Maryland, USA, 2005. ACM. Google ScholarDigital Library
- W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct mining of discriminative and essential frequent patterns via model-based search tree. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 230--238, Las Vegas, Nevada, USA, 2008. ACM. Google ScholarDigital Library
- C. Gao and J. Wang. Direct mining of discriminative patterns for classifying uncertain data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 861--870, Washington, DC, USA, 2010. ACM. Google ScholarDigital Library
- M. A. Hearst. Clustering versus faceted categories for information exploration. Commun. ACM, 49(4):59--61, 2006. Google ScholarDigital Library
- H. Kim, S. Kim, T. Weninger, J. Han, and T. F. Abdelzaher. Ndpmine: Efficiently mining discriminative numerical features for pattern-based classification. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, pages 35--50, Barcelona, Spain, 2010. Springer. Google ScholarDigital Library
- J. M. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 91--101, Edmonton, Alberta, Canada, 2002. ACM. Google ScholarDigital Library
- B. Lent, R. Agrawal, and R. Srikant. Discovering trends in text databases. In KDD, pages 227--230, 1997.Google Scholar
- B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceedings of the Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 80--86, 1998.Google Scholar
- J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. In Proceedings of the 17th International Conference on Data Engineering, pages 215--224, Heidelberg, Germany, 2001. IEEE Computer Society. Google ScholarDigital Library
- A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Multidimensional content exploration. PVLDB, 1(1):660--671, 2008. Google ScholarDigital Library
- Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: shrinking the petacube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 464--475, Madison, Wisconsin, 2002. ACM. Google ScholarDigital Library
- J. Wang and G. Karypis. On mining instance-centric classification rules. IEEE Trans. Knowl. Data Eng., 18(11):1497--1511, 2006. Google ScholarDigital Library
- Y. Yang, N. Bansal, W. Dakka, P. G. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, pages 34--43, Barcelona, Spain, 2009. ACM. Google ScholarDigital Library
Index Terms
- Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing
Recommendations
Interesting-phrase mining for ad-hoc text analytics
Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and ...
Interesting pattern mining in multi-relational data
Mining patterns from multi-relational data is a problem attracting increasing interest within the data mining community. Traditional data mining approaches are typically developed for single-table databases, and are not directly applicable to multi-...
A method for mining top-rank-k frequent closed itemsets
Collective intelligent information and database systemsMining frequent closed itemsets (FCIs) is important in mining non-redundant (minimal) association rules. Therefore, many algorithms have been developed for mining FCIs with reduced mining time and memory usage. For mining FCIs, algorithms use the minimum ...
Comments