research-article

SPuManTE: Significant Pattern Mining with Unconditional Testing

Authors:
Leonardo Pellegrina

Università di Padova, Padova, Italy

Università di Padova, Padova, Italy
View Profile

,
Matteo Riondato

Amherst College, Amherst, MA, USA

Amherst College, Amherst, MA, USA
View Profile

,
Fabio Vandin

Università di Padova, Padova, Italy

Università di Padova, Padova, Italy
View Profile

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2019Pages 1528–1538https://doi.org/10.1145/3292500.3330978

Published:25 July 2019Publication History

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1528–1538

ABSTRACT

We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset. SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold. A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test. Computational requirements have limited the use of unconditional tests in significant pattern discovery, but UT overcomes this issue by obtaining the required probabilities in a novel efficient way. SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, founded in statistical learning theory. This combination allows SPuManTE to be very efficient, while also enjoying high statistical power. The results of our experimental evaluation show that SPuManTE allows the discovery of statistically significant patterns while properly accounting for uncertainties in patterns' frequencies due to the data generation process.

References

R. Agrawal, T. Imieli'nski, and A. Swami. 1993. Mining association rules between sets of items in large databases. SIGMOD'93 . Google ScholarDigital Library
G. A. Barnard. 1945. A new test for 2texttimes2 tables. Nature , Vol. 156 (1945).Google Scholar
Y. Benjamini and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. (1995).Google Scholar
R. Berger. 1994. Power comparison of exact unconditional tests for comparing two binomial proportions. Institute of Statistics Mimeo Series (1994).Google Scholar
C. E. Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilità . Pubb. del Regio Istituto Superiore di Scienze Econ. e Comm. di Firenze , Vol. 8 (1936).Google Scholar
R. D. Boschloo. 1970. Raised conditional level of significance for the 2× 2-table when testing the equality of two probabilities. Statistica Neerlandica , Vol. 24 (1970).Google ScholarCross Ref
Leena Choi, Jeffrey D. Blume, and William D. Dupont. 2015. Elucidating the foundations of statistical inference with 2texttimes2 tables. PloS one , Vol. 10, 4 (2015), e0121263.Google ScholarCross Ref
Ronald A. Fisher. 1922. On the interpretation of χ^2$ from contingency tables, and the calculation of P . Journal of the Royal Statistical Society , Vol. 85, 1 (1922), 87--94.Google ScholarCross Ref
W. H"am"al"ainen. 2016. New upper bounds for tight and fast approximation of Fisher's exact test in dependency rule mining. Comp. Stat. & Data Anal. , Vol. 93 (2016). Google ScholarDigital Library
W. H"am"al"ainen and G. I. Webb. 2018. A Tutorial on Statistically Sound Pattern Discovery. Data Mining and Knowledge Discovery (2018).Google Scholar
Zengyou He, Simeng Zhang, and Jun Wu. 2018. Significance-based Discriminative Sequential Pattern Mining. Expert Systems with Applications (2018).Google Scholar
Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979).Google Scholar
J. Komiyama, M. Ishihata, H. Arimura, T. Nishibayashi, and S. Minato. 2017. Statistical Emerging Pattern Mining with Multiple Testing Correction. KDD'17 . Google ScholarDigital Library
W. J. Lentz. 1976. Generating Bessel functions in Mie scattering calculations using continued fractions. Applied Optics , Vol. 15 (1976).Google ScholarCross Ref
F. Llinares-López, M. Sugiyama, L. Papaxanthos, and K. Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. KDD'15 .Google Scholar
C. R. Mehta and P. Senchaudhuri. 2003. Conditional versus unconditional exact tests for comparing two binomials. Cytel Software Corporation , Vol. 675 (2003).Google Scholar
S. Minato, T. Uno, K. Tsuda, A. Terada, and J. Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In ECML-PKDD'14 . Google ScholarDigital Library
Laetitia Papaxanthos, F. Llinares-López, D. Bodenham, and K. Borgwardt. 2016. Finding significant combinations of features in the presence of categorical covariates. NIPS'16 . Google ScholarDigital Library
L. Pellegrina and F. Vandin. 2018. Efficient Mining of the Most Significant Patterns with Permutation Testing. KDD'18 . Google ScholarDigital Library
M. Riondato and E. Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. KDD'15 . Google ScholarDigital Library
M. Sugiyama, F. Llinares-López, N. Kasenburg, and K. M. Borgwardt. 2015. Significant subgraph mining with multiple testing correction. SDM'15 .Google Scholar
R. E. Tarone. 1990. A modified Bonferroni method for discrete data. Biometrics (1990).Google Scholar
A. Terada, D. duVerle, and K. Tsuda. 2016. Significant Pattern Mining with Confounding Variables. PAKDD'16 .Google Scholar
A. Terada, H. Kim, and J. Sese. 2015. High-speed Westfall-Young permutation procedure for genome-wide association studies. ACM-BCB'15 . Google ScholarDigital Library
A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese. 2013. Statistical significance of combinatorial regulations. Proc. of the Nat. Acad. of Scien. , Vol. 110 (2013).Google Scholar
F. Vandin, A. Papoutsaki, B. J. Raphael, and E. Upfal. 2015. Accurate computation of survival statistics in genome-wide studies. PLoS Comp. Bio. , Vol. 11 (2015).Google Scholar
G. I. Webb. 2006. Discovering significant rules. In KDD'06 . Google ScholarDigital Library
G. I. Webb. 2007. Discovering significant patterns. Machine learning , Vol. 68 (2007). Google ScholarDigital Library
G. I. Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning , Vol. 71 (2008). Google ScholarDigital Library
P. H. Westfall and S. S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley Series in Prob. and Stat. (1993).Google Scholar

Index Terms

SPuManTE: Significant Pattern Mining with Unconditional Testing
1. Information systems
  1. Information systems applications
    1. Data mining
2. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Contingency table analysis

Recommendations

Hypothesis Testing and Statistically-sound Pattern Mining
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

The availability of massive datasets has highlighted the need of computationally efficient and statistically-sound methods to extracts patterns while providing rigorous guarantees on the quality of the results, in particular with respect to false ...
Read More
Identification of adverse disease agents and risk analysis using frequent pattern mining
Highlights
- An improved algorithm is proposed to construct FP-tree from transactional datasets.
Abstract
Life-threatening illnesses such as cancer, cirrhosis of the liver, and hepatitis have become crucial problems for humanity. The risk of mortality can be deflated by early detection of symptoms and providing the best possible diagnosis. ...
Read More
Parallel frequent itemset mining using systolic arrays

Since extraction of frequent itemsets from a transaction database is crucial to several data mining tasks such as association rule generation, so frequent itemset mining is one of the most important concepts in data mining. One of the major problems in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
family-wise error rate
hypothesis testing
itemset mining
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '19 Paper Acceptance Rate110of1,200submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 431
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SPuManTE: Significant Pattern Mining with Unconditional Testing

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hypothesis Testing and Statistically-sound Pattern Mining

Identification of adverse disease agents and risk analysis using frequent pattern mining

Parallel frequent itemset mining using systolic arrays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SPuManTE: Significant Pattern Mining with Unconditional Testing

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hypothesis Testing and Statistically-sound Pattern Mining

Identification of adverse disease agents and risk analysis using frequent pattern mining

Parallel frequent itemset mining using systolic arrays

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media