skip to main content
10.1145/3292500.3330978acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

SPuManTE: Significant Pattern Mining with Unconditional Testing

Published:25 July 2019Publication History

ABSTRACT

We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset. SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold. A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test. Computational requirements have limited the use of unconditional tests in significant pattern discovery, but UT overcomes this issue by obtaining the required probabilities in a novel efficient way. SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, founded in statistical learning theory. This combination allows SPuManTE to be very efficient, while also enjoying high statistical power. The results of our experimental evaluation show that SPuManTE allows the discovery of statistically significant patterns while properly accounting for uncertainties in patterns' frequencies due to the data generation process.

References

  1. R. Agrawal, T. Imieli'nski, and A. Swami. 1993. Mining association rules between sets of items in large databases. SIGMOD'93 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. A. Barnard. 1945. A new test for 2texttimes2 tables. Nature , Vol. 156 (1945).Google ScholarGoogle Scholar
  3. Y. Benjamini and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. (1995).Google ScholarGoogle Scholar
  4. R. Berger. 1994. Power comparison of exact unconditional tests for comparing two binomial proportions. Institute of Statistics Mimeo Series (1994).Google ScholarGoogle Scholar
  5. C. E. Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilità . Pubb. del Regio Istituto Superiore di Scienze Econ. e Comm. di Firenze , Vol. 8 (1936).Google ScholarGoogle Scholar
  6. R. D. Boschloo. 1970. Raised conditional level of significance for the 2× 2-table when testing the equality of two probabilities. Statistica Neerlandica , Vol. 24 (1970).Google ScholarGoogle ScholarCross RefCross Ref
  7. Leena Choi, Jeffrey D. Blume, and William D. Dupont. 2015. Elucidating the foundations of statistical inference with 2texttimes2 tables. PloS one , Vol. 10, 4 (2015), e0121263.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ronald A. Fisher. 1922. On the interpretation of χ^2$ from contingency tables, and the calculation of P . Journal of the Royal Statistical Society , Vol. 85, 1 (1922), 87--94.Google ScholarGoogle ScholarCross RefCross Ref
  9. W. H"am"al"ainen. 2016. New upper bounds for tight and fast approximation of Fisher's exact test in dependency rule mining. Comp. Stat. & Data Anal. , Vol. 93 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. H"am"al"ainen and G. I. Webb. 2018. A Tutorial on Statistically Sound Pattern Discovery. Data Mining and Knowledge Discovery (2018).Google ScholarGoogle Scholar
  11. Zengyou He, Simeng Zhang, and Jun Wu. 2018. Significance-based Discriminative Sequential Pattern Mining. Expert Systems with Applications (2018).Google ScholarGoogle Scholar
  12. Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979).Google ScholarGoogle Scholar
  13. J. Komiyama, M. Ishihata, H. Arimura, T. Nishibayashi, and S. Minato. 2017. Statistical Emerging Pattern Mining with Multiple Testing Correction. KDD'17 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. J. Lentz. 1976. Generating Bessel functions in Mie scattering calculations using continued fractions. Applied Optics , Vol. 15 (1976).Google ScholarGoogle ScholarCross RefCross Ref
  15. F. Llinares-López, M. Sugiyama, L. Papaxanthos, and K. Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. KDD'15 .Google ScholarGoogle Scholar
  16. C. R. Mehta and P. Senchaudhuri. 2003. Conditional versus unconditional exact tests for comparing two binomials. Cytel Software Corporation , Vol. 675 (2003).Google ScholarGoogle Scholar
  17. S. Minato, T. Uno, K. Tsuda, A. Terada, and J. Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In ECML-PKDD'14 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Laetitia Papaxanthos, F. Llinares-López, D. Bodenham, and K. Borgwardt. 2016. Finding significant combinations of features in the presence of categorical covariates. NIPS'16 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Pellegrina and F. Vandin. 2018. Efficient Mining of the Most Significant Patterns with Permutation Testing. KDD'18 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Riondato and E. Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. KDD'15 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Sugiyama, F. Llinares-López, N. Kasenburg, and K. M. Borgwardt. 2015. Significant subgraph mining with multiple testing correction. SDM'15 .Google ScholarGoogle Scholar
  22. R. E. Tarone. 1990. A modified Bonferroni method for discrete data. Biometrics (1990).Google ScholarGoogle Scholar
  23. A. Terada, D. duVerle, and K. Tsuda. 2016. Significant Pattern Mining with Confounding Variables. PAKDD'16 .Google ScholarGoogle Scholar
  24. A. Terada, H. Kim, and J. Sese. 2015. High-speed Westfall-Young permutation procedure for genome-wide association studies. ACM-BCB'15 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese. 2013. Statistical significance of combinatorial regulations. Proc. of the Nat. Acad. of Scien. , Vol. 110 (2013).Google ScholarGoogle Scholar
  26. F. Vandin, A. Papoutsaki, B. J. Raphael, and E. Upfal. 2015. Accurate computation of survival statistics in genome-wide studies. PLoS Comp. Bio. , Vol. 11 (2015).Google ScholarGoogle Scholar
  27. G. I. Webb. 2006. Discovering significant rules. In KDD'06 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. I. Webb. 2007. Discovering significant patterns. Machine learning , Vol. 68 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. I. Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning , Vol. 71 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. H. Westfall and S. S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley Series in Prob. and Stat. (1993).Google ScholarGoogle Scholar

Index Terms

  1. SPuManTE: Significant Pattern Mining with Unconditional Testing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2019
        3305 pages
        ISBN:9781450362016
        DOI:10.1145/3292500

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 July 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '19 Paper Acceptance Rate110of1,200submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader