ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery

Abuissa, Maryam; Lee, Alexander; Riondato, Matteo

doi:10.1007/s10618-023-00938-4

ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery

Published: 06 May 2023

Volume 37, pages 1692–1718, (2023)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

306 Accesses
2 Citations
3 Altmetric
Explore all metrics

“Forth Eorlingas!”

— King Théoden of Rohan.

Abstract

We introduce a novel class of null models for the statistical validation of results obtained from binary transactional and sequence datasets. Our null models are Row-Order Agnostic (ROA), i.e., do not consider the order of rows in the observed dataset to be fixed, in stark contrast with previous null models, which are Row-Order Enforcing (ROE). We present ROhAN, an algorithmic framework for efficiently sampling datasets from ROA models according to user-specified distributions, which is a necessary step for the resampling-based statistical hypothesis tests employed to validate the results. ROhAN uses Metropolis-Hastings or rejection sampling to build on top of existing or future ROE sampling procedures. Our experimental evaluation shows that ROA models are very different from ROE ones, impacting the statistical validation, and that ROhAN is efficient, mixes fast, and scales well as the dataset grows.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Alice and the Caterpillar: A more descriptive null model for assessing data mining results

Article 02 November 2023

gRosSo: mining statistically robust patterns from a sequence of datasets

Article Open access 02 August 2022

Discovering Frequent Patterns in Very Large Transactional Databases

Notes

Throughout this work, we use “significant” to mean “statistically significant”.
We drop “binary” and just use “transactional” in the rest of this work.
When considering the order of transactions as fixed, as ROE models do, there is a 1:1 correspondence between transactional datasets and binary matrices. The row sums correspond to the transaction lengths, and the column sum to the supports of single items.
Preserving properties exactly can partially be incorporated in these null models, but they usually make it impossible to derive a closed form for \(\pi\), with relevant computational consequences. The same is also true for many complex in-expectation constraints (Cimini et al. 2019).
We do not indicate this fact in the notation, to keep it light.
Gionis et al. (2007) focus on the case where \(\pi\) is the uniform distribution, but extending their discussion to a generic \(\pi\) is straightforward.
If not even earlier.
Some presentations of the algorithms mention a “transaction identifier” associated to each transaction, but this identifier is used only to uniquely label transactions, not for the purpose of ordering the rows, and it is in part a leftover of the idea that a transactional dataset is stored in a table in a relational database.
We assume \(\left( {\begin{array}{c}0\\ 0,\dotsc , 0\end{array}}\right) = 1\).
Other definitions of Q are possible. Deriving, for example, a tight lower bound \(b \le \min _{\mathcal {D}\in \mathcal {Z}_{\textrm{A}}} \textsf{c}(\mathcal {D})\) can be used to define \(Q \doteq {\left| {\mathcal {Z}_{\textrm{E}}}\right| }/{(\left| {\mathcal {Z}_{\textrm{A}}}\right| b)}\), which would lead to more samples being accepted. We leave this derivation to future work.
https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
The other parameters of the generator were left to their default values.

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proc. 20th Int. Conf. Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’94, pp 487–499
Besag J, Clifford P (1989) Generalized monte carlo significance tests. Biometrika 76(4):633–642
Article MathSciNet MATH Google Scholar
Casella G, Robert CP, Wells MT (2004) Generalized accept-reject sampling schemes. In: A Festschrift for Herman Rubin, IMS Lecture Notes - Monograph Series, vol 45. IMS, p 342–347
Chen Y, Diaconis P, Holmes SP et al. (2005) Sequential monte carlo methods for statistical analysis of tables. J Am Stat Assoc 100(469):109–120
Article MathSciNet MATH Google Scholar
Cimini G, Squartini T, Saracco F et al. (2019) The statistical physics of real-world networks. Nature Rev Phys 1(1):58–71
Article Google Scholar
Connor EF, Simberloff D (1979) The assembly of species communities: chance or competition? Ecology 60(6):1132–1140
Article Google Scholar
Dalleiger S, Vreeken J (2022) Discovering significant patterns under sequential false discovery control. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, KDD ’22
De Bie T (2010) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Disc 23(3):407–446. https://doi.org/10.1007/s10618-010-0209-3
Article MathSciNet MATH Google Scholar
Ferkingstad E, Holden L, Sandve GK (2015) Monte Carlo null models for genomic data. Stat Sci 30(1):59–71
Article MathSciNet MATH Google Scholar
Fout AM (2022) New methods for fixed-margin binary matrix sampling, Fréchet covariance, and MANOVA tests for random objects in multiple metric spaces. PhD thesis, Colorado State University
Gionis A, Mannila H, Mielikäinen T et al. (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Dis from Data (TKDD) 1(3):14
Article Google Scholar
Gwadera R, Crestani F (2010) Ranking sequential patterns with respect to significance. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 286–299
Hämäläinen W, Webb GI (2019) A tutorial on statistically sound pattern discovery. Data Min Knowl Disc 33(2):325–377
Article MathSciNet MATH Google Scholar
Hrovat G, Fister IJr, Yermak K, et al. (2015) Interestingness measure for mining sequential patterns in sports. Journal of Intelligent & Fuzzy Systems 29(5):1981–1994
Jenkins S, Walzer-Goldfeld S, Riondato M (2022) SPEck: mining statistically-significant sequential patterns efficiently with exact sampling. Data Min Knowl Disc 36(4):1575–1599
Article MathSciNet MATH Google Scholar
Lehmann EL, Romano JP (2022) Testing Statistical Hypotheses, 4th edn. Springer, Berlin
Book MATH Google Scholar
Low-Kam C, Raïssi C, Kaytoue M, et al. (2013) Mining statistically significant sequential patterns. In: 2013 IEEE 13th International Conference on Data Mining, IEEE, pp 488–497
Méger N, Rigotti C, Pothier C (2015) Swap randomization of bases of sequences for mining satellite image times series. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 190–205
Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, KDD ’98, pp 274–278
Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press
Ojala M (2010) Assessing data mining results on matrices with randomization. In: 2010 IEEE International Conference on Data Mining, pp 959–964, https://doi.org/10.1109/ICDM.2010.20
Ojala M, Vuokko N, Kallio A, et al. (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM International Conference on Data Mining, SDM ’08, pp 494–505, https://doi.org/10.1137/1.9781611972788.45,
Ojala M, Garriga GC, Gionis A, et al. (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 2010 SIAM International Conference on Data Mining (SDM), pp 906–917, https://doi.org/10.1137/1.9781611972801.79
Pei J, Han J, Mortazavi-Asl B et al. (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440
Article Google Scholar
Pellegrina L, Riondato M, Vandin F (2019) Hypothesis testing and statistically-sound pattern mining. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, KDD ’19, pp 3215–3216, https://doi.org/10.1145/3292500.3332286,
Pinxteren S, Calders T (2021) Efficient permutation testing for significant sequential patterns. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), SIAM, pp 19–27
Preti G, De Francisci Morales G, Riondato M (2022) Alice and the caterpillar: A more descriptive null models for assessing data mining results. In: Proceedings of the 22nd IEEE International Conference on Data Mining, pp 418–427
Ryser HJ (1963) Combinatorial Mathematics. American Mathematical Society, USA
Book MATH Google Scholar
Stanley RP (2011) Enumerative Combinatorics, vol 1, 2nd edn. Cambridge University Press
Tonon A, Vandin F (2019) Permutation strategies for mining significant sequential patterns. In: 2019 IEEE International Conference on Data Mining (ICDM), IEEE, pp 1330–1335
Vreeken J, Tatti N (2014) Interesting patterns. In: Frequent pattern mining. Springer, p 105–134
Wang G (2020) A fast MCMC algorithm for the uniform sampling of binary matrices with fixed margins. Electron J Statistics 14(1):1690–1706
Article MathSciNet MATH Google Scholar
Westfall PH, Young SS (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons
Zimmermann A (2014) The data problem in data mining. SIGKDD Explor 16(2):38–45
Article Google Scholar

Download references

Acknowledgements

This work is supported in part by NSF award IIS-2006765.

Author information

Authors and Affiliations

Department of Computer Science, Amherst College, Box #2232, Amherst College, Amherst, MA, 01002, USA
Maryam Abuissa, Alexander Lee & Matteo Riondato

Authors

Maryam Abuissa
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Lee
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Riondato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Riondato.

Additional information

Responsible editors: Charalampos E. Tsourakakis, Tania Cerquitelli, Marcello Restelli, Fabio Vitale.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Abuissa, M., Lee, A. & Riondato, M. ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery. Data Min Knowl Disc 37, 1692–1718 (2023). https://doi.org/10.1007/s10618-023-00938-4

Download citation

Received: 28 November 2022
Accepted: 06 April 2023
Published: 06 May 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10618-023-00938-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery

Abstract

Access this article

Similar content being viewed by others

Alice and the Caterpillar: A more descriptive null model for assessing data mining results

gRosSo: mining statistically robust patterns from a sequence of datasets

Discovering Frequent Patterns in Very Large Transactional Databases

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery

Abstract

Access this article

Similar content being viewed by others

Alice and the Caterpillar: A more descriptive null model for assessing data mining results

gRosSo: mining statistically robust patterns from a sequence of datasets

Discovering Frequent Patterns in Very Large Transactional Databases

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation