Sequential pattern sampling with norm-based utility

Diop, Lamine; Diop, Cheikh Talibouya; Giacometti, Arnaud; Li, Dominique; Soulet, Arnaud

doi:10.1007/s10115-019-01417-3

Sequential pattern sampling with norm-based utility

Regular Paper
Published: 26 October 2019

Volume 62, pages 2029–2065, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Lamine Diop ORCID: orcid.org/0000-0003-4539-2549^1,2,
Cheikh Talibouya Diop¹,
Arnaud Giacometti²,
Dominique Li² &
…
Arnaud Soulet²

375 Accesses
7 Citations
Explore all metrics

Abstract

Sequential pattern mining has been introduced by Agrawal and Srikant (in: Proceedings of ICDE’95, pp 3–14, 1995) 2 decades ago, and its usefulness has been widely proved for different mining tasks and application fields such as web usage mining, text mining, bioinformatics, fraud detection and so on. Since 1995, despite numerous optimization proposals, sequential pattern mining remains a costly task that often generates too many patterns. This limit, also reached by itemset mining, was circumvented by pattern sampling. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for different kinds of patterns including itemsets and subgraphs has not yet been applied to sequential patterns. In this paper, we propose the first method dedicated to sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a class of interestingness measures relying on the norm of the sequence, named norm-based utilities. In particular, it enables to add constraints on the norm of sampled patterns to control the length of the drawn patterns and to avoid the pitfall of the “long tail” where the rarest patterns flood the user. We propose a new two-step random procedure integrating this class of measures, named \({\textsc {NUSSampling}}\) that randomly draws sequential patterns according to frequency weighted by a norm-based utility. We demonstrate that this method performs an exact sampling according to the underlying measure. Moreover, despite the use of rejection sampling, the experimental study shows that \({\textsc {NUSSampling}}\) remains efficient. We especially focus on the interest of norm constraints and exponential decays that help to draw general patterns of the “head”. We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state-of-the-art proposals showing the interest of sequential pattern sampling with norm-based utility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Stratified random sampling from streaming and stored data

Article 23 October 2020

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Article Open access 25 September 2019

Notes

By convention, we consider that \(\left( {\begin{array}{c}n\\ p\end{array}}\right) = 0\) if \(p > n\).
The datasets bms and sign are available at http://www.philippe-fournier-viger.com/spmf and other ones at http://www.mybytes.de/#data.
https://github.com/zakimjz/IBMGenerator.
The dataset reuters is available at ana.cachopo.org/datasets-for-single-label-text-categorization and other ones, at www.mybytes.de/#data.
We recall that for every sequence s, we define s[K] as: \(s[K]= \cap _{k \in K} s[k].\)

References

Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of ICDE’95, pp 3–14
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Article MathSciNet Google Scholar
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of EDBT vol 96, pp 3–17
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60
Article Google Scholar
Pei J, Han J, Mortazavi-Asl B, Pinto H (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of ICDE 2001, pp 215–224
Garofalakis MN, Rastogi R, Shim K (1999) Spirit: sequential pattern mining with regular expression constraints. VLDB 99:7–10
Google Scholar
Pei J, Han J, Lakshmanan LV (2001) Mining frequent itemsets with convertible constraints. In: Proceedings of ICDE 2001. IEEE, pp 433–442
Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of ICDE 2004. IEEE, pp 79–90
Yan X, Han J, Afshar R (2003) Clospan: mining: closed sequential patterns in large datasets. In: Proceedings of SDM 2003. SIAM, pp 166–177
Bosc G, Boulicaut J-F, Raïssi C, Kaytoue M (2016) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min Knowl Discov 32:1–47
MathSciNet MATH Google Scholar
Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB 2(1):730–741
Article Google Scholar
Bhuiyan M, Mukhopadhyay S, Hasan MA (2012) Interactive pattern mining on hidden data: a sampling-based solution. In: Proceedings of CIKM 2012, pp 95–104
Giacometti A, Soulet A (2017) Interactive pattern sampling for characterizing unlabeled data. In: Proceedings of IDA 2017, pp 99–111
Dzyuba V, Mv L, Nijssen S, De Raedt L (2014) Interactive learning of pattern rankings. Int J Artif Intell Tools 23(06):32
Article Google Scholar
Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3–4):119–130
Article Google Scholar
Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 534–546
Anderson C (2004) The long tail. Wired Mag 12(10):170–177
Google Scholar
Diop L, Diop CT, Giacometti A, Li D, Soulet A (2018) Sequential pattern sampling with norm constraints. In: IEEE international conference on data mining (ICDM)
Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of SIGKDD 2011, pp 582–590
van Leeuwen M (2014) Interactive data exploration using pattern mining. In: Andreas Holzinger, Igor Jurisica (eds)Interactive knowledge discovery and data mining in biomedical informatics. Springer, Berlin, pp 169–182
Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73
Google Scholar
Hu Q, Imielinski T (2017) Alpine: progressive itemset mining with definite guarantees. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 63–71
He Z, Xu X, Huang ZJ, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Article Google Scholar
Toivonen H et al (1996) Sampling large databases for association rules. Proc VLDB 96(96):134–145
Google Scholar
Luo C, Chung SM (2004) A scalable algorithm for mining maximal frequent sequences using sampling. In: Proceedings of ICTAI 2004. IEEE, pp 156–165
Raissi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Proceedings of ICDM 2007, pp 631–636
Bendimerad AA, Plantevit M, Robardet C (2016) Unsupervised exceptional attributed sub-graph mining in urban data. In: Proceedings of ICDM 2016. IEEE, pp 21–30
Giacometti A, Soulet A (2018) Dense neighborhood pattern sampling in numerical data. In: Proceedings of SDM 2018, pp 756–764
Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of SDM 2010. SIAM, pp 177–188
Li G, Zaki MJ (2012) Sampling minimal frequent boolean (DNF) patterns. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 87–95
Moens S, Goethals B (2013) Randomly sampling maximal itemsets. In: Proceedings of IDEA workshop 2013, pp 79–86
Dzyuba V, van Leeuwen M, De Raedt L (2017) Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Discov 31(5):1266–1293
Article MathSciNet Google Scholar
Gueguen M, Sentieys O, Termier A (2019) Accelerating itemset sampling using satisfiability constraints on FPGA. In: IEEE/ACM design, automation and test in Europe (DATE)
Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 69–77
Moens S, Boley M (2014) Instant exceptional model mining using weighted controlled pattern sampling. In: International symposium on intelligent data analysis. Springer, Berlin, pp 203–214
Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Min Knowl Discov 29(3):732–764
Article MathSciNet Google Scholar
Egho E, Gay D, Boullé M, Voisine N, Clérot F (2017) A user parameter-free approach for mining robust sequential classification rules. Knowl Inf Syst 52(1):53–81
Article Google Scholar
Fournier-Viger P, Gomariz A, Gueniche T, Mwamikazi E, Thomas R (2013) Tks: efficient mining of top-k sequential patterns. In: International conference on advanced data mining and applications. Springer, Berlin, pp 109–120
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, Berlin 1–17
Arimura H, Uno T (2009) Polynomial-delay and polynomial-space algorithms for mining closed sequences, graphs, pictures in accessible set systems. In: Proceedings of the 2009 SIAM international conference on data mining. SIAM, pp 1088–1099

Download references

Acknowledgements

This work has been partly supported by the CEA-MITIC (Centre d’Excellence Africain en Mathématiques, Informatique et TIC).

Author information

Authors and Affiliations

Department of Applied Science and Technology, University Gaston Berger, Saint-Louis, Senegal
Lamine Diop & Cheikh Talibouya Diop
LIFAT, University of Tours, Blois, France
Lamine Diop, Arnaud Giacometti, Dominique Li & Arnaud Soulet

Authors

Lamine Diop
View author publications
You can also search for this author in PubMed Google Scholar
Cheikh Talibouya Diop
View author publications
You can also search for this author in PubMed Google Scholar
Arnaud Giacometti
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Li
View author publications
You can also search for this author in PubMed Google Scholar
Arnaud Soulet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lamine Diop.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Diop, L., Diop, C.T., Giacometti, A. et al. Sequential pattern sampling with norm-based utility. Knowl Inf Syst 62, 2029–2065 (2020). https://doi.org/10.1007/s10115-019-01417-3

Download citation

Received: 18 January 2019
Revised: 08 October 2019
Accepted: 12 October 2019
Published: 26 October 2019
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10115-019-01417-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequential pattern sampling with norm-based utility

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Stratified random sampling from streaming and stored data

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sequential pattern sampling with norm-based utility

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Stratified random sampling from streaming and stored data

Time-Dependent Graphs: Definitions, Applications, and Algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation