ABSTRACT
We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset. SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold. A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test. Computational requirements have limited the use of unconditional tests in significant pattern discovery, but UT overcomes this issue by obtaining the required probabilities in a novel efficient way. SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, founded in statistical learning theory. This combination allows SPuManTE to be very efficient, while also enjoying high statistical power. The results of our experimental evaluation show that SPuManTE allows the discovery of statistically significant patterns while properly accounting for uncertainties in patterns' frequencies due to the data generation process.
- R. Agrawal, T. Imieli'nski, and A. Swami. 1993. Mining association rules between sets of items in large databases. SIGMOD'93 . Google ScholarDigital Library
- G. A. Barnard. 1945. A new test for 2texttimes2 tables. Nature , Vol. 156 (1945).Google Scholar
- Y. Benjamini and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. (1995).Google Scholar
- R. Berger. 1994. Power comparison of exact unconditional tests for comparing two binomial proportions. Institute of Statistics Mimeo Series (1994).Google Scholar
- C. E. Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilità . Pubb. del Regio Istituto Superiore di Scienze Econ. e Comm. di Firenze , Vol. 8 (1936).Google Scholar
- R. D. Boschloo. 1970. Raised conditional level of significance for the 2× 2-table when testing the equality of two probabilities. Statistica Neerlandica , Vol. 24 (1970).Google ScholarCross Ref
- Leena Choi, Jeffrey D. Blume, and William D. Dupont. 2015. Elucidating the foundations of statistical inference with 2texttimes2 tables. PloS one , Vol. 10, 4 (2015), e0121263.Google ScholarCross Ref
- Ronald A. Fisher. 1922. On the interpretation of χ^2$ from contingency tables, and the calculation of P . Journal of the Royal Statistical Society , Vol. 85, 1 (1922), 87--94.Google ScholarCross Ref
- W. H"am"al"ainen. 2016. New upper bounds for tight and fast approximation of Fisher's exact test in dependency rule mining. Comp. Stat. & Data Anal. , Vol. 93 (2016). Google ScholarDigital Library
- W. H"am"al"ainen and G. I. Webb. 2018. A Tutorial on Statistically Sound Pattern Discovery. Data Mining and Knowledge Discovery (2018).Google Scholar
- Zengyou He, Simeng Zhang, and Jun Wu. 2018. Significance-based Discriminative Sequential Pattern Mining. Expert Systems with Applications (2018).Google Scholar
- Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979).Google Scholar
- J. Komiyama, M. Ishihata, H. Arimura, T. Nishibayashi, and S. Minato. 2017. Statistical Emerging Pattern Mining with Multiple Testing Correction. KDD'17 . Google ScholarDigital Library
- W. J. Lentz. 1976. Generating Bessel functions in Mie scattering calculations using continued fractions. Applied Optics , Vol. 15 (1976).Google ScholarCross Ref
- F. Llinares-López, M. Sugiyama, L. Papaxanthos, and K. Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. KDD'15 .Google Scholar
- C. R. Mehta and P. Senchaudhuri. 2003. Conditional versus unconditional exact tests for comparing two binomials. Cytel Software Corporation , Vol. 675 (2003).Google Scholar
- S. Minato, T. Uno, K. Tsuda, A. Terada, and J. Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In ECML-PKDD'14 . Google ScholarDigital Library
- Laetitia Papaxanthos, F. Llinares-López, D. Bodenham, and K. Borgwardt. 2016. Finding significant combinations of features in the presence of categorical covariates. NIPS'16 . Google ScholarDigital Library
- L. Pellegrina and F. Vandin. 2018. Efficient Mining of the Most Significant Patterns with Permutation Testing. KDD'18 . Google ScholarDigital Library
- M. Riondato and E. Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. KDD'15 . Google ScholarDigital Library
- M. Sugiyama, F. Llinares-López, N. Kasenburg, and K. M. Borgwardt. 2015. Significant subgraph mining with multiple testing correction. SDM'15 .Google Scholar
- R. E. Tarone. 1990. A modified Bonferroni method for discrete data. Biometrics (1990).Google Scholar
- A. Terada, D. duVerle, and K. Tsuda. 2016. Significant Pattern Mining with Confounding Variables. PAKDD'16 .Google Scholar
- A. Terada, H. Kim, and J. Sese. 2015. High-speed Westfall-Young permutation procedure for genome-wide association studies. ACM-BCB'15 . Google ScholarDigital Library
- A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese. 2013. Statistical significance of combinatorial regulations. Proc. of the Nat. Acad. of Scien. , Vol. 110 (2013).Google Scholar
- F. Vandin, A. Papoutsaki, B. J. Raphael, and E. Upfal. 2015. Accurate computation of survival statistics in genome-wide studies. PLoS Comp. Bio. , Vol. 11 (2015).Google Scholar
- G. I. Webb. 2006. Discovering significant rules. In KDD'06 . Google ScholarDigital Library
- G. I. Webb. 2007. Discovering significant patterns. Machine learning , Vol. 68 (2007). Google ScholarDigital Library
- G. I. Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning , Vol. 71 (2008). Google ScholarDigital Library
- P. H. Westfall and S. S. Young. 1993. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley Series in Prob. and Stat. (1993).Google Scholar
Index Terms
- SPuManTE: Significant Pattern Mining with Unconditional Testing
Recommendations
Hypothesis Testing and Statistically-sound Pattern Mining
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningThe availability of massive datasets has highlighted the need of computationally efficient and statistically-sound methods to extracts patterns while providing rigorous guarantees on the quality of the results, in particular with respect to false ...
Identification of adverse disease agents and risk analysis using frequent pattern mining
Highlights- An improved algorithm is proposed to construct FP-tree from transactional datasets.
AbstractLife-threatening illnesses such as cancer, cirrhosis of the liver, and hepatitis have become crucial problems for humanity. The risk of mortality can be deflated by early detection of symptoms and providing the best possible diagnosis. ...
Parallel frequent itemset mining using systolic arrays
Since extraction of frequent itemsets from a transaction database is crucial to several data mining tasks such as association rule generation, so frequent itemset mining is one of the most important concepts in data mining. One of the major problems in ...
Comments