ABSTRACT
In this paper, we propose a set of novel regression-based approaches to effectively and efficiently summarize frequent itemset patterns. Specifically, we show that the problem of minimizing the restoration error for a set of itemsets based on a probabilistic model corresponds to a non-linear regression problem. We show that under certain conditions, we can transform the nonlinear regression problem to a linear regression problem. We propose two new methods, k-regression and tree-regression, to partition the entire collection of frequent itemsets in order to minimize the restoration error. The K-regression approach, employing a K-means type clustering method, guarantees that the total restoration error achieves a local minimum. The tree-regression approach employs a decision-tree type of top-down partition process. In addition, we discuss alternatives to estimate the frequency for the collection of itemsets being covered by the k representative itemsets. The experimental evaluation on both real and synthetic datasets demonstrates that our approaches significantly improve the summarization performance in terms of both accuracy (restoration error), and computational cost.
- The r project for statistical computing. http://www.r-project.org/.Google Scholar
- Foto Afrati, Aristides Gionis, and Heikki Mannila. Approximating a collection of frequent sets. In KDD, 2004. Google ScholarDigital Library
- Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD Conference, pages 207--216, May 1993. Google ScholarDigital Library
- Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, pages 487--499, 1994. Google ScholarDigital Library
- Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pages 3--14, 1995. Google ScholarDigital Library
- Alan Agresti. Categorical Data Analysis. Wiley, 2002.Google ScholarCross Ref
- Christan Borgelt. Apriori implementation. http://fuzzy.cs.Uni-Magdeburg.de/ borgelt/Software.Google Scholar
- Toon Calders and Bart Goethals. Non-derivable itemset mining. Data Min. Knowl. Discov., 14(1):171--206, 2007. Google ScholarDigital Library
- Gene H. Golub and Charles F. Van Loan. matrix computations, 3rd. The John Hopkins University Press, 1996. Google ScholarDigital Library
- Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000. Google ScholarDigital Library
- Jiawei Han, Jianyong Wang, Ying Lu, and Petre Tzvetkov. Mining top-k frequent closed patterns without minimum support. In ICDM, 2002. Google ScholarDigital Library
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001.Google ScholarCross Ref
- Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink, Jan Prins, and Alexander Tropsha. Mining protein family-specific residue packing patterns from protein structure graphs. In Eighth International Conference on Research in Computational Molecular Biology (RECOMB), pages 308--315, 2004. Google ScholarDigital Library
- Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Principles of Knowledge Discovery and Data Mining (PKDD2000), pages 13--23, 2000. Google ScholarDigital Library
- Ruoming Jin and Gagan Agrawal. A systematic approach for optimizing complex mining tasks on multiple datasets. In Proceedings of the ICDE Conference, 2006. Google ScholarDigital Library
- Ron Kohavi, Carla Brodley, Brian Frasca, Llew Mason, and Zijian Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2(2):86--98, 2000. http://www.ecn.purdue.edu/KDDCUP. Google ScholarDigital Library
- F. R. Kschischang, B. J. Frey, and H. A. Loeliger. Factor graphs and the sum-product algorithm. Information Theory, IEEE Transactions on, 47(2):498--519, 2001. Google ScholarDigital Library
- Wei Li and Ari Mozes. Computing frequent itemsets inside oracle 10g. In VLDB, pages 1253--1256, 2004. Google ScholarDigital Library
- Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Discovering frequent closed itemsets for association rules. In ICDT '99: Proceeding of the 7th International Conference on Database Theory, 1999. Google ScholarDigital Library
- Dmitry Pavlov, Heikki Mannila, and Padhraic Smyth. Beyond independence: Probabilistic models for query approximation on binary transaction data. IEEE Trans. Knowl. Data Eng., 15(6):1409--1421, 2003. Google ScholarDigital Library
- Jr. Roberto J. Bayardo. Efficiently mining long patterns from databases. In SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998. Google ScholarDigital Library
- G. A. F. Seber and C. J. Wild. Nonlinear Regression. John Weiley & Sons, Inc., 1989.Google Scholar
- Craig Utley. Microsoft sql server 9.0 technical articles: Introduction to sql server 2005 data mining. http://technet.microsoft.com/en-us/library/ms345131.aspx.Google Scholar
- Chao Wang and Srinivasan Parthasarathy. Summarizing itemset patterns using probabilistic models. In KDD, 2006. Google ScholarDigital Library
- Takashi Washio and Hiroshi Motoda. State of the art of graph-based data mining. SIGKDD Explor. Newsl., 5(1):59--68, 2003. Google ScholarDigital Library
- Dong Xin, Hong Cheng, Xifeng Yan, and Jiawei Han. Extracting redundancy-aware top-k patterns. In KDD, 2006. Google ScholarDigital Library
- Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressed frequent-pattern sets. In VLDB, 2005. Google ScholarDigital Library
- Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarizing itemset patterns: a profile-based approach. In KDD, 2005. Google ScholarDigital Library
- M. T. Yang, R. Kasturi, and A. Sivasubramaniam. An Automatic Scheduler for Real-Time Vision Applications. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2001. Google ScholarDigital Library
- Takeshi Yoshizawa, Iko Pramudiono, and Masaru Kitsuregawa. SQL based association rule mining using commercial RDBMS (IBM db2 UBD EEE). In Data Warehousing and Knowledge Discovery, pages 301--306, 2000. Google ScholarDigital Library
- Mohammed J. Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71--80, 2002. Google ScholarDigital Library
- Mohammed J. Zaki and Charu C. Aggarwal. Xrules: an effective structural classifier for xml data. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316--325, 2003. Google ScholarDigital Library
Index Terms
- Effective and efficient itemset pattern summarization: regression-based approaches
Recommendations
Frequent subgraph summarization with error control
WAIM'13: Proceedings of the 14th international conference on Web-Age Information ManagementFrequent subgraph mining has been an important research problem in the literature. However, the huge number of discovered frequent subgraphs becomes the bottleneck for exploring and understanding the generated patterns. In this paper, we propose to ...
From frequent itemsets to semantically meaningful visual patterns
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningData mining techniques that are successful in transaction and text data may not be simply applied to image data that contain high-dimensional features and have spatial structures. It is not a trivial task to discover meaningful visual patterns in image ...
Non-derivable itemset mining
All frequent itemset mining algorithms rely heavily on the monotonicity principle for pruning. This principle allows for excluding candidate itemsets from the expensive counting phase. In this paper, we present sound and complete deduction rules to ...
Comments