Article

Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Authors:
Hui Xiong

University of Minnesota

University of Minnesota
View Profile

,
Shashi Shekhar

University of Minnesota

University of Minnesota
View Profile

,
Pang-Ning Tan

Michigan State University

Michigan State University
View Profile

,
Vipin Kumar

University of Minnesota

University of Minnesota
View Profile

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2004Pages 334–343https://doi.org/10.1145/1014052.1014090

Published:22 August 2004Publication History

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 334–343

ABSTRACT

Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.

References

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD, 1993. Google ScholarDigital Library
R. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery Journal, pages 217--240, 2000. Google ScholarDigital Library
S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In ACM SIGMOD, 1997. Google ScholarDigital Library
C. Bucila, J. Gehrke, D. Kifer, and W. M. White. Dualminer: a dual-pruning algorithm for itemsets with constraints. In ACM SIGKDD, 2002. Google ScholarDigital Library
D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, 2001. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In ICDE, 2000.Google ScholarCross Ref
W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In ACM SIGKDD, 2001. Google ScholarDigital Library
G. Grahne, L. V. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. In ICDE, 2000.Google ScholarCross Ref
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 2000. Google ScholarDigital Library
C. Jermaine. The computational complexity of high-dimensional correlation search. In ICDM, 2001. Google ScholarDigital Library
C. Jermaine. Playing hide-and-seek with correlations. In ACM SIGKDD, 2003. Google ScholarDigital Library
S. K. Kachigan. Multivariate Statistical Analysis: A Conceptual Introduction. Radius Press, 1991.Google Scholar
R. Ng, L. Lakshmanan, J. Han, and A. Pang. Exploratory mining via constrained frequent set queries. In ACM SIGMOD, 1999. Google ScholarDigital Library
R. Rastogi and K. Shim. Mining optimized association rules with categorical and numeric attributes. IEEE TKDE, 14(1), January 2002. Google ScholarDigital Library
H. T. Reynolds. The Analysis of Cross-classifications. The Free Press, New York, 1977.Google Scholar
R. Rymon. Search through systematic set enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.Google ScholarDigital Library
H. Xiong,, S. Shekhar, P. Tan, and V. Kumar. Taper: An efficient two-step approach for all-pairs correlation query in transaction databases. In Technical Report 03-020, computer science and engineering, University of Minnesota - Twin Cities, May 2003.Google Scholar
G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.Google Scholar

Index Terms

Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Measuring Pearson's correlation coefficient of fuzzy numbers with different membership functions under weakest t-norm

In statistical theory, the correlation coefficient has been widely used to assess a possible linear association between two variables and often calculated in crisp environment. In this study, a simplified and effective method is presented to compute the ...
Read More
Finding highly correlated pairs efficiently with powerful pruning
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

We consider the problem of finding highly correlated pairs in a large data set. That is, given a threshold not too small, we wish to report all the pairs of items (or binary attributes) whose (Pearson) correlation coefficients are greater than the ...
Read More
Asymptotic properties of Pearson's rank-variate correlation coefficient in bivariate normal model

This paper establishes the asymptotic closed forms of the expectation and variance of the Pearson's rank-variate correlation coefficient (PRVCC) with respect to samples drawn from bivariate normal populations. The variance-stability features of Fisher's ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2004
874 pages
ISBN:1581138881
DOI:10.1145/1014052
General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Pearson's correlation coefficient
statistical computing
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 1,302
  Total Downloads
- Downloads (Last 12 months)82
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Measuring Pearson's correlation coefficient of fuzzy numbers with different membership functions under weakest t-norm

Finding highly correlated pairs efficiently with powerful pruning

Asymptotic properties of Pearson's rank-variate correlation coefficient in bivariate normal model