Skip to main content
Log in

Sequential pattern sampling with norm-based utility

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Sequential pattern mining has been introduced by Agrawal and Srikant (in: Proceedings of ICDE’95, pp 3–14, 1995) 2 decades ago, and its usefulness has been widely proved for different mining tasks and application fields such as web usage mining, text mining, bioinformatics, fraud detection and so on. Since 1995, despite numerous optimization proposals, sequential pattern mining remains a costly task that often generates too many patterns. This limit, also reached by itemset mining, was circumvented by pattern sampling. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for different kinds of patterns including itemsets and subgraphs has not yet been applied to sequential patterns. In this paper, we propose the first method dedicated to sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a class of interestingness measures relying on the norm of the sequence, named norm-based utilities. In particular, it enables to add constraints on the norm of sampled patterns to control the length of the drawn patterns and to avoid the pitfall of the “long tail” where the rarest patterns flood the user. We propose a new two-step random procedure integrating this class of measures, named \({\textsc {NUSSampling}}\) that randomly draws sequential patterns according to frequency weighted by a norm-based utility. We demonstrate that this method performs an exact sampling according to the underlying measure. Moreover, despite the use of rejection sampling, the experimental study shows that \({\textsc {NUSSampling}}\) remains efficient. We especially focus on the interest of norm constraints and exponential decays that help to draw general patterns of the “head”. We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state-of-the-art proposals showing the interest of sequential pattern sampling with norm-based utility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. By convention, we consider that \(\left( {\begin{array}{c}n\\ p\end{array}}\right) = 0\) if \(p > n\).

  2. The datasets bms and sign are available at http://www.philippe-fournier-viger.com/spmf and other ones at http://www.mybytes.de/#data.

  3. https://github.com/zakimjz/IBMGenerator.

  4. The dataset reuters is available at ana.cachopo.org/datasets-for-single-label-text-categorization and other ones, at www.mybytes.de/#data.

  5. We recall that for every sequence s, we define s[K] as: \(s[K]= \cap _{k \in K} s[k].\)

References

  1. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of ICDE’95, pp 3–14

  2. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86

    Article  MathSciNet  Google Scholar 

  3. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of EDBT vol 96, pp 3–17

  4. Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60

    Article  Google Scholar 

  5. Pei J, Han J, Mortazavi-Asl B, Pinto H (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of ICDE 2001, pp 215–224

  6. Garofalakis MN, Rastogi R, Shim K (1999) Spirit: sequential pattern mining with regular expression constraints. VLDB 99:7–10

    Google Scholar 

  7. Pei J, Han J, Lakshmanan LV (2001) Mining frequent itemsets with convertible constraints. In: Proceedings of ICDE 2001. IEEE, pp 433–442

  8. Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of ICDE 2004. IEEE, pp 79–90

  9. Yan X, Han J, Afshar R (2003) Clospan: mining: closed sequential patterns in large datasets. In: Proceedings of SDM 2003. SIAM, pp 166–177

  10. Bosc G, Boulicaut J-F, Raïssi C, Kaytoue M (2016) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min Knowl Discov 32:1–47

    MathSciNet  MATH  Google Scholar 

  11. Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB 2(1):730–741

    Article  Google Scholar 

  12. Bhuiyan M, Mukhopadhyay S, Hasan MA (2012) Interactive pattern mining on hidden data: a sampling-based solution. In: Proceedings of CIKM 2012, pp 95–104

  13. Giacometti A, Soulet A (2017) Interactive pattern sampling for characterizing unlabeled data. In: Proceedings of IDA 2017, pp 99–111

  14. Dzyuba V, Mv L, Nijssen S, De Raedt L (2014) Interactive learning of pattern rankings. Int J Artif Intell Tools 23(06):32

    Article  Google Scholar 

  15. Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal 2(3–4):119–130

    Article  Google Scholar 

  16. Dzyuba V, van Leeuwen M (2017) Learning what matters—sampling interesting patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 534–546

  17. Anderson C (2004) The long tail. Wired Mag 12(10):170–177

    Google Scholar 

  18. Diop L, Diop CT, Giacometti A, Li D, Soulet A (2018) Sequential pattern sampling with norm constraints. In: IEEE international conference on data mining (ICDM)

  19. Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of SIGKDD 2011, pp 582–590

  20. van Leeuwen M (2014) Interactive data exploration using pattern mining. In: Andreas Holzinger, Igor Jurisica (eds)Interactive knowledge discovery and data mining in biomedical informatics. Springer, Berlin, pp 169–182

  21. Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73

    Google Scholar 

  22. Hu Q, Imielinski T (2017) Alpine: progressive itemset mining with definite guarantees. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 63–71

  23. He Z, Xu X, Huang ZJ, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118

    Article  Google Scholar 

  24. Toivonen H et al (1996) Sampling large databases for association rules. Proc VLDB 96(96):134–145

    Google Scholar 

  25. Luo C, Chung SM (2004) A scalable algorithm for mining maximal frequent sequences using sampling. In: Proceedings of ICTAI 2004. IEEE, pp 156–165

  26. Raissi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Proceedings of ICDM 2007, pp 631–636

  27. Bendimerad AA, Plantevit M, Robardet C (2016) Unsupervised exceptional attributed sub-graph mining in urban data. In: Proceedings of ICDM 2016. IEEE, pp 21–30

  28. Giacometti A, Soulet A (2018) Dense neighborhood pattern sampling in numerical data. In: Proceedings of SDM 2018, pp 756–764

  29. Boley M, Gärtner T, Grosskreutz H (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: Proceedings of SDM 2010. SIAM, pp 177–188

  30. Li G, Zaki MJ (2012) Sampling minimal frequent boolean (DNF) patterns. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 87–95

  31. Moens S, Goethals B (2013) Randomly sampling maximal itemsets. In: Proceedings of IDEA workshop 2013, pp 79–86

  32. Dzyuba V, van Leeuwen M, De Raedt L (2017) Flexible constrained sampling with guarantees for pattern mining. Data Min Knowl Discov 31(5):1266–1293

    Article  MathSciNet  Google Scholar 

  33. Gueguen M, Sentieys O, Termier A (2019) Accelerating itemset sampling using satisfiability constraints on FPGA. In: IEEE/ACM design, automation and test in Europe (DATE)

  34. Boley M, Moens S, Gärtner T (2012) Linear space direct pattern sampling using coupling from the past. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 69–77

  35. Moens S, Boley M (2014) Instant exceptional model mining using weighted controlled pattern sampling. In: International symposium on intelligent data analysis. Springer, Berlin, pp 203–214

  36. Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Min Knowl Discov 29(3):732–764

    Article  MathSciNet  Google Scholar 

  37. Egho E, Gay D, Boullé M, Voisine N, Clérot F (2017) A user parameter-free approach for mining robust sequential classification rules. Knowl Inf Syst 52(1):53–81

    Article  Google Scholar 

  38. Fournier-Viger P, Gomariz A, Gueniche T, Mwamikazi E, Thomas R (2013) Tks: efficient mining of top-k sequential patterns. In: International conference on advanced data mining and applications. Springer, Berlin, pp 109–120

  39. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548

  40. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, Berlin 1–17

  41. Arimura H, Uno T (2009) Polynomial-delay and polynomial-space algorithms for mining closed sequences, graphs, pictures in accessible set systems. In: Proceedings of the 2009 SIAM international conference on data mining. SIAM, pp 1088–1099

Download references

Acknowledgements

This work has been partly supported by the CEA-MITIC (Centre d’Excellence Africain en Mathématiques, Informatique et TIC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lamine Diop.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Diop, L., Diop, C.T., Giacometti, A. et al. Sequential pattern sampling with norm-based utility. Knowl Inf Syst 62, 2029–2065 (2020). https://doi.org/10.1007/s10115-019-01417-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01417-3

Keywords

Navigation