skip to main content
10.1145/1014052.1014090acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Published:22 August 2004Publication History

ABSTRACT

Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.

References

  1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery Journal, pages 217--240, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In ACM SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Bucila, J. Gehrke, D. Kifer, and W. M. White. Dualminer: a dual-pruning algorithm for itemsets with constraints. In ACM SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In ICDE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  7. W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In ACM SIGKDD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Grahne, L. V. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. In ICDE, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Jermaine. The computational complexity of high-dimensional correlation search. In ICDM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Jermaine. Playing hide-and-seek with correlations. In ACM SIGKDD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. K. Kachigan. Multivariate Statistical Analysis: A Conceptual Introduction. Radius Press, 1991.Google ScholarGoogle Scholar
  13. R. Ng, L. Lakshmanan, J. Han, and A. Pang. Exploratory mining via constrained frequent set queries. In ACM SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Rastogi and K. Shim. Mining optimized association rules with categorical and numeric attributes. IEEE TKDE, 14(1), January 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. T. Reynolds. The Analysis of Cross-classifications. The Free Press, New York, 1977.Google ScholarGoogle Scholar
  16. R. Rymon. Search through systematic set enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Xiong,, S. Shekhar, P. Tan, and V. Kumar. Taper: An efficient two-step approach for all-pairs correlation query in transaction databases. In Technical Report 03-020, computer science and engineering, University of Minnesota - Twin Cities, May 2003.Google ScholarGoogle Scholar
  18. G. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2004
      874 pages
      ISBN:1581138881
      DOI:10.1145/1014052

      Copyright © 2004 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2004

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader