ABSTRACT
Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.
- R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD, 1993. Google ScholarDigital Library
- R. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery Journal, pages 217--240, 2000. Google ScholarDigital Library
- S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In ACM SIGMOD, 1997. Google ScholarDigital Library
- C. Bucila, J. Gehrke, D. Kifer, and W. M. White. Dualminer: a dual-pruning algorithm for itemsets with constraints. In ACM SIGKDD, 2002. Google ScholarDigital Library
- D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, 2001. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In ICDE, 2000.Google ScholarCross Ref
- W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In ACM SIGKDD, 2001. Google ScholarDigital Library
- G. Grahne, L. V. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. In ICDE, 2000.Google ScholarCross Ref
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 2000. Google ScholarDigital Library
- C. Jermaine. The computational complexity of high-dimensional correlation search. In ICDM, 2001. Google ScholarDigital Library
- C. Jermaine. Playing hide-and-seek with correlations. In ACM SIGKDD, 2003. Google ScholarDigital Library
- S. K. Kachigan. Multivariate Statistical Analysis: A Conceptual Introduction. Radius Press, 1991.Google Scholar
- R. Ng, L. Lakshmanan, J. Han, and A. Pang. Exploratory mining via constrained frequent set queries. In ACM SIGMOD, 1999. Google ScholarDigital Library
- R. Rastogi and K. Shim. Mining optimized association rules with categorical and numeric attributes. IEEE TKDE, 14(1), January 2002. Google ScholarDigital Library
- H. T. Reynolds. The Analysis of Cross-classifications. The Free Press, New York, 1977.Google Scholar
- R. Rymon. Search through systematic set enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.Google ScholarDigital Library
- H. Xiong,, S. Shekhar, P. Tan, and V. Kumar. Taper: An efficient two-step approach for all-pairs correlation query in transaction databases. In Technical Report 03-020, computer science and engineering, University of Minnesota - Twin Cities, May 2003.Google Scholar
- G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.Google Scholar
Index Terms
- Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs
Recommendations
Measuring Pearson's correlation coefficient of fuzzy numbers with different membership functions under weakest t-norm
In statistical theory, the correlation coefficient has been widely used to assess a possible linear association between two variables and often calculated in crisp environment. In this study, a simplified and effective method is presented to compute the ...
Finding highly correlated pairs efficiently with powerful pruning
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementWe consider the problem of finding highly correlated pairs in a large data set. That is, given a threshold not too small, we wish to report all the pairs of items (or binary attributes) whose (Pearson) correlation coefficients are greater than the ...
Asymptotic properties of Pearson's rank-variate correlation coefficient in bivariate normal model
This paper establishes the asymptotic closed forms of the expectation and variance of the Pearson's rank-variate correlation coefficient (PRVCC) with respect to samples drawn from bivariate normal populations. The variance-stability features of Fisher's ...
Comments