skip to main content
10.1145/1401890.1401941acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Effective and efficient itemset pattern summarization: regression-based approaches

Published:24 August 2008Publication History

ABSTRACT

In this paper, we propose a set of novel regression-based approaches to effectively and efficiently summarize frequent itemset patterns. Specifically, we show that the problem of minimizing the restoration error for a set of itemsets based on a probabilistic model corresponds to a non-linear regression problem. We show that under certain conditions, we can transform the nonlinear regression problem to a linear regression problem. We propose two new methods, k-regression and tree-regression, to partition the entire collection of frequent itemsets in order to minimize the restoration error. The K-regression approach, employing a K-means type clustering method, guarantees that the total restoration error achieves a local minimum. The tree-regression approach employs a decision-tree type of top-down partition process. In addition, we discuss alternatives to estimate the frequency for the collection of itemsets being covered by the k representative itemsets. The experimental evaluation on both real and synthetic datasets demonstrates that our approaches significantly improve the summarization performance in terms of both accuracy (restoration error), and computational cost.

References

  1. The r project for statistical computing. http://www.r-project.org/.Google ScholarGoogle Scholar
  2. Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets. In KDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Conference, pages 207--216, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3--14, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alan Agresti. Categorical Data Analysis. Wiley, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  7. Christan Borgelt. Apriori implementation. http://fuzzy.cs.Uni-Magdeburg.de/ borgelt/Software.Google ScholarGoogle Scholar
  8. Toon Calders and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discov., 14(1):171--206, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gene H. Golub and Charles F. Van Loan. matrix computations, 3rd. The John Hopkins University Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jiawei Han, Jianyong Wang, Ying Lu, and Petre Tzvetkov. Mining top-k frequent closed patterns without minimum support. In ICDM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  13. Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. Mining protein family-specific residue packing patterns from protein structure graphs. In Eighth International Conference on Research in Computational Molecular Biology (RECOMB), pages 308--315, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Knowledge Discovery and Data Mining (PKDD2000), pages 13--23, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ruoming Jin and Gagan Agrawal. A systematic approach for optimizing complex mining tasks on multiple datasets. In Proceedings of the ICDE Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ron Kohavi, Carla Brodley, Brian Frasca, Llew Mason, and Zijian Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2(2):86--98, 2000. http://www.ecn.purdue.edu/KDDCUP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. R. Kschischang, B. J. Frey, and H. A. Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 47(2):498--519, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wei Li and Ari Mozes. Computing frequent itemsets inside oracle 10g. In VLDB, pages 1253--1256, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering frequent closed itemsets for association rules. In ICDT '99: Proceeding of the 7th International Conference on Database Theory, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dmitry Pavlov, Heikki Mannila, and Padhraic Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans. Knowl. Data Eng., 15(6):1409--1421, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jr. Roberto J. Bayardo. Efficiently mining long patterns from databases. In SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. A. F. Seber and C. J. Wild. Nonlinear Regression. John Weiley & Sons, Inc., 1989.Google ScholarGoogle Scholar
  23. Craig Utley. Microsoft sql server 9.0 technical articles: Introduction to sql server 2005 data mining. http://technet.microsoft.com/en-us/library/ms345131.aspx.Google ScholarGoogle Scholar
  24. Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilistic models. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Takashi Washio and Hiroshi Motoda. State of the art of graph-based data mining. SIGKDD Explor. Newsl., 5(1):59--68, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patterns. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern sets. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarizing itemset patterns: a profile-based approach. In KDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. T. Yang, R. Kasturi, and A. Sivasubramaniam. An Automatic Scheduler for Real-Time Vision Applications. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Takeshi Yoshizawa, Iko Pramudiono, and Masaru Kitsuregawa. SQL based association rule mining using commercial RDBMS (IBM db2 UBD EEE). In Data Warehousing and Knowledge Discovery, pages 301--306, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mohammed J. Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71--80, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mohammed J. Zaki and Charu C. Aggarwal. Xrules: an effective structural classifier for xml data. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316--325, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Effective and efficient itemset pattern summarization: regression-based approaches

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2008
      1116 pages
      ISBN:9781605581934
      DOI:10.1145/1401890
      • General Chair:
      • Ying Li,
      • Program Chairs:
      • Bing Liu,
      • Sunita Sarawagi

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader