skip to main content
10.1145/2247596.2247628acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

Published:27 March 2012Publication History

ABSTRACT

In this paper we consider the problem of mining frequently occurring interesting phrases in large document collections in an ad-hoc fashion. Ad-hoc refers to the ability to perform such analyses over text corpora that can be an arbitrary subset of a global set of documents. Most of the times the identification of these ad-hoc document collections is driven by a user or application defined query with the aim of gathering statistics describing the sub-collection, as a starting point for further data analysis tasks. Our approach to mine the top-k most interesting phrases consists of a novel indexing technique, called Sequence Pattern Indexing (SeqPattIndex), that benefits from the observation that phrases often overlap sequentially. We devise a forest based index for phrases and an further improved version with additional redundancy elimination power. The actual top-k phrase mining algorithm operating on these indices is a combination of a simple merge join and inspired by the pattern-growth framework from the data mining community, making use of early termination and search space pruning technologies that enhance the runtime performance. Overall, our approach has on average a lower index space consumption as well as a lower runtime for the top-k phrase mining task, as we demonstrate in the experimental evaluation using real-world data.

References

  1. The dblp computer science bibliography. http://www.informatik.uni-trier.de/ley/db/.Google ScholarGoogle Scholar
  2. National science foundation awards. http://www.nsf.gov/awardsearch/.Google ScholarGoogle Scholar
  3. Pubmed. http://www.ncbi.nlm.nih.gov/pubmed/.Google ScholarGoogle Scholar
  4. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D. C., 1993. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3--14, Taipei, Taiwan, 1995. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. J. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting-phrase mining for ad-hoc text analytics. PVLDB, 3(1):1348--1357, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Cheng, X. Yan, J. Han, and C.-W. Hsu. Discriminative frequent pattern analysis for effective classification. In Proceedings of the 23rd International Conference on Data Engineering, pages 716--725, Istanbul, Turkey, 2007. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  8. H. Cheng, X. Yan, J. Han, and P. S. Yu. Direct discriminative pattern mining for effective classification. In Proceedings of the 24th International Conference on Data Engineering, pages 169--178, Cancún, México, 2008. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 670--681, Baltimore, Maryland, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct mining of discriminative and essential frequent patterns via model-based search tree. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 230--238, Las Vegas, Nevada, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Gao and J. Wang. Direct mining of discriminative patterns for classifying uncertain data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 861--870, Washington, DC, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. A. Hearst. Clustering versus faceted categories for information exploration. Commun. ACM, 49(4):59--61, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Kim, S. Kim, T. Weninger, J. Han, and T. F. Abdelzaher. Ndpmine: Efficiently mining discriminative numerical features for pattern-based classification. In Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, pages 35--50, Barcelona, Spain, 2010. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. M. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 91--101, Edmonton, Alberta, Canada, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Lent, R. Agrawal, and R. Srikant. Discovering trends in text databases. In KDD, pages 227--230, 1997.Google ScholarGoogle Scholar
  16. B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceedings of the Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 80--86, 1998.Google ScholarGoogle Scholar
  17. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. In Proceedings of the 17th International Conference on Data Engineering, pages 215--224, Heidelberg, Germany, 2001. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. Multidimensional content exploration. PVLDB, 1(1):660--671, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Sismanis, A. Deligiannakis, N. Roussopoulos, and Y. Kotidis. Dwarf: shrinking the petacube. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 464--475, Madison, Wisconsin, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Wang and G. Karypis. On mining instance-centric classification rules. IEEE Trans. Knowl. Data Eng., 18(11):1497--1511, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Yang, N. Bansal, W. Dakka, P. G. Ipeirotis, N. Koudas, and D. Papadias. Query by document. In Proceedings of the Second International Conference on Web Search and Web Data Mining, WSDM 2009, pages 34--43, Barcelona, Spain, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
      March 2012
      643 pages
      ISBN:9781450307901
      DOI:10.1145/2247596

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 March 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate7of10submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader