Abstract
The extraction of patterns displaying significant association with a class label is a key data mining task with wide application in many domains. We introduce and study a variant of the problem that requires to mine the top-k statistically significant patterns, thus providing tight control on the number of patterns reported in output. We develop TopKWY, the first algorithm to mine the top-k significant patterns while rigorously controlling the family-wise error rate of the output, and provide theoretical evidence of its effectiveness. TopKWY crucially relies on a novel strategy to explore statistically significant patterns and on several key implementation choices, which may be of independent interest. Our extensive experimental evaluation shows that TopKWY enables the extraction of the most significant patterns from large datasets which could not be analyzed by the state-of-the-art. In addition, TopKWY improves over the state-of-the-art even for the extraction of all significant patterns.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. SIGMOD Rec 22:207–216
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international confereence on very large data bases (VLDB ’94), San Francisco, CA, USA. Morgan Kaufmann Publishers Inc, pp 487–499
Atzmueller M (2015) Subgroup discovery. Wiley Interdiscip Rev Data Min Knowl Discov 5(1):35–49
Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. ACM Sigmod Rec 27(2):85–93
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Ro Stat Soc Ser B (Methodol) 57:289–300
Bonferroni C (1936) Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8:3–62
Dong G, Bailey J (2012) Contrast data mining: concepts, algorithms, and applications. CRC Press, Boca Raton
Duivesteijn W, Knobbe A (2011) Exploiting false discoveries–statistical validation of patterns and quality measures in subgroup discovery. In: 2011 IEEE 11th international conference on data mining. IEEE, pp 151–160
Fisher RA (1922) On the interpretation of \(\chi \) 2 from contingency tables, and the calculation of p. J R Stat Soc 85(1):87–94
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data (TKDD) 1(3):14
Hämäläinen W (2012) Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32(2):383–414
Hämäläinen W, Webb GI (2019) A tutorial on statistically sound pattern discovery. Data Min Knowl Disc 33(2):325–377
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen W, Naughton JF, Bernstein PA (eds) SIGMOD conference. ACM, New YorkD, pp 1–12
Han J, Wang J, Lu Y, Tzvetkov P (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings 2002 IEEE international conference on data mining, 2002. ICDM 2003. IEEE, pp 211–218
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Mining Knowl Discov 15:55–86
Herrera F, Carmona CJ, González P, Del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525
Komiyama J, Ishihata M, Arimura H, Nishibayashi T, Minato S-I (2017) Statistical emerging pattern mining with multiple testing correction. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 897–906
Lehmann EL, Romano JP (2012) Generalizations of the familywise error rate. In: Selected works of EL Lehmann. Springer, pp 719–735
Li J, Liu J, Toivonen H, Satou K, Sun Y, Sun B (2014) Discovering statistically non-redundant subgroups. Knowl-Based Syst 67:315–327
Llinares-López F, Sugiyama M, Papaxanthos L, Borgwardt K (2015) Fast and memory-efficient significant pattern mining via permutation testing. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 725–734
Minato S-i, Uno T, Tsuda K, Terada A, Sese J (2014) A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 422–436
Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 647–652
Nijssen S, Kok JN (2006) Frequent subgraph miners: runtimes don’t say everything. In: MLG 2006, p 173
Papaxanthos L, Llinares-López F, Bodenham D, Borgwardt K (2016) Finding significant combinations of features in the presence of categorical covariates. In: Advances in neural information processing systems, pp 2279–2287
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: International conference on database theory. Springer, pp 398–416
Pietracaprina A, Vandin F (2007) Efficient incremental mining of top-K frequent closed itemsets. In: Discovery science, volume 4755 of lecture notes in computer science. Springer, Berlin Heidelberg, pp 275–280
Pellegrina L, Vandin F (2018) Efficient mining of the most significant patterns with permutation testing. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 2070–2079
Pellegrina L, Riondato M, Vandin F (2019a) Hypothesis testing and statistically-sound pattern mining. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 3215–3216
Pellegrina L, Riondato M, Vandin F (2019b) Spumante: Significant pattern mining with unconditional testing. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining-KDD, vol 19
Tarone R (1990) A modified bonferroni method for discrete data. Biometrics 515–522
Terada A, Okada-Hatakeyama M, Tsuda K, Sese J (2013a) Statistical significance of combinatorial regulations. Proc Nat Acad Sci 110(32):12996–13001
Terada A, Tsuda K, Sese J (2013b) Fast westfall-young permutation procedure for combinatorial regulation discovery. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 153–158
Terada A, Kim H, Sese J (2015) High-speed westfall-young permutation procedure for genome-wide association studies. In: Proceedings of the 6th ACM conference on bioinformatics, computational biology and health informatics. ACM, pp 17–26
Terada A, Tsuda K et al (2016) Significant pattern mining with confounding variables. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 277–289
Uno T, Kiyomi M, Arimura H (2005) Lcm ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. ACM, pp 77–86
van der Laan MJ, Dudoit S, Pollard KS (2004) Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Stat Appl Genet Mol Biol 3(1):1–25
Webb GI (2006) Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 434–443
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33
Webb GI (2008) Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach Learn 71(2–3):307–323
Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley Series in Probability and Statistics, Hoboknen
Wörlein M, Meinl T, Fischer I, Philippsen M (2005) A quantitative comparison of the subgraph miners mofa, gspan, ffsm, and gaston. In: European conference on principles of data mining and knowledge discovery. Springer, pp 392–403
Zandolin D, Pietracaprina A (2003) Mining frequent itemsets using patricia tries. In: Proceedings of FIMI03, vol 90
Acknowledgements
This work is supported, in part by the National Science Foundation grant IIS-1247581 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1247581), by the University of Padova grants SID2017 and STARS: Algorithms for Inferential Data Mining, and by MIUR, the Italian Ministry of Education, University and Research, under PRIN Project n. 20174LF3T8 AHeAD (Efficient Algorithms for HArnessing Networked Data).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M. J. Zaki.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work appeared in the proceedings of ACM KDD’18 as (Pellegrina and Vandin 2018).
Rights and permissions
About this article
Cite this article
Pellegrina, L., Vandin, F. Efficient mining of the most significant patterns with permutation testing. Data Min Knowl Disc 34, 1201–1234 (2020). https://doi.org/10.1007/s10618-020-00687-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-020-00687-8