Abstract
This paper proposes the sky-signature model, an extension of the signature model Gautrais et al. (in: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Springer, 2017b) to multi-objective optimization. The signature approach considers a sequence of itemsets, and given a number k it returns a segmentation of the sequence in k segments such that the number of items occuring in all segments is maximized. The limitation of this approach is that it requires to manually set k, and thus fixes the temporal granularity at which the data is analyzed. The sky-signature model proposed in this paper removes this requirement, and allows to examine the results at multiple levels of granularity, while keeping a compact output. This paper also proposes efficient algorithms to mine sky-signatures, as well as an experimental validation both real data both from the retail domain and from natural language processing (political speeches).
Similar content being viewed by others
Notes
Including the 3 presidential debates Speeches of Clinton prior to April 2015 were discarded.
Climate change issues became a topic of interest when Clinton attacked Trump on him saying that climate change is a hoax in the first presidential debate (September 26, 2016).
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 207–216
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering (ICDE), pp 3–14
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 17th international conference on management of data, pp 207–216
Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284
Bellman R (2013) Dynamic programming. Dover Publications, Inc., New York
Bingham E (2010) Finding segmentations of sequences. In: Džeroski S, Goethals B, Panov P (eds) Inductive databases and constraint-based data mining. Springer, New York, pp 177–197
Borzsony S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings 17th international conference on data engineering, pp 421–430. https://doi.org/10.1109/ICDE.2001.914855
Bosc G, Boulicaut JF, Raïssi C, Kaytoue M (2018) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min Knowl Discov 32(3):604–650
Casas-Garriga G (2003) Discovering unbounded episodes in sequential data. In: Proceedings of European conference on principles of data mining and knowledge discovery (ECML/PKDD), pp 83–94
Chundi P, Rosenkrantz DJ (2008) Efficient algorithms for segmentation of item-set time series. Data Min Knowl Discov 17(3):377–401
Cueva PL, Bertaux A, Termier A, Méhaut J, Santana M (2012) Debugging embedded multimedia application traces through periodic pattern mining. In: Proceedings of the 12th international conference on embedded software, pp 13–22
Cule B, Goethals B, Robardet C (2009) A new constraint for mining sets in sequences. In: Proceedings of the SIAM international conference on data mining SDM’09, SIAM, pp 317–328
De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 7th international conference on data mining (ICDM), pp 237–248
Gautrais C, Cellier P, Quiniou R, Termier A (2017a) Topic signatures in political campaign speeches. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 2342–2347
Gautrais C, Quiniou R, Cellier P, Guyet T, Termier A (2017b) Purchase signatures of retail customers. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Springer, pp 110–121
Gautrais C, Cellier P, van Leeuwen M, Termier A (2020) Widening for MDL-based retail signature discovery. In: Berthold MR, Feelders A, Krempl G (eds) Advances in intelligent data analysis XVIII—18th international symposium on intelligent data analysis, IDA 2020, Konstanz, Germany, April 27–29, 2020, proceedings. Lecture notes in computer science, vol 12080. Springer, pp 197–209
Guns T, Nijssen S, De Raedt L (2013) k-pattern set mining under constraints. Trans Knowl Data Eng (TKDE) 25(2):402–418
Haiminen N, Gionis A (2004) Unimodal segmentation of sequences. In: Proceedings of the 4th international conference on data mining (ICDM), pp 106–113
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. SIGMOD Rec 29(2):1–12
Han J, Wang J, Lu Y, Tzvetkov P (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings of the international conference on data mining (ICDM), pp 211–218
Kiernan J, Terzi E (2009) Constructing comprehensive summaries of large event sequences. ACM Trans Knowl Discov Data. https://doi.org/10.1145/1631162.1631169
Kung HT, Luccio F, Preparata FP (1975) On finding the maxima of a set of vectors. J ACM 22(4):469–476. https://doi.org/10.1145/321906.321910
Ma S, Hellerstein JL (2001) Mining partially periodic event patterns with unknown periods. In: Proceedings of the 17th international conference on data engineering (ICDE), pp 205–214
Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289
Miguéis VL, Camanho AS, Falcão e Cunha J (2011) Mining customer loyalty card programs: the improvement of service levels enabled by innovative segmentation and promotions design. In: Proceedings of the international conference on exploring services science (IESS), pp 83–97
Miguéis VL, Camanho AS, Falcão e Cunha J (2012) Customer data mining for lifestyle segmentation. Expert Syst Appl 39(10):9359–9366
Naturel X, Gros P (2008) Detecting repeats for video structuring. Multimed Tools Appl 38(2):233–252
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory (ICDT), pp 398–416
Pei J, Han J, Mortazavi-Asl B, Pinto H (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the international conference on data engineering (ICDE), pp 215–224
Soulet A, Raïssi C, Plantevit M, Cremilleux B (2011) Mining dominant patterns in the sky. In: Proceedings of the 11th international conference on data mining (ICDM), pp 655–664
Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66
Terzi E, Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: Proceedings of the SIAM conference on data mining (SDM), pp 314–325
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423
Tseng VS, Shie BE, Wu CW, Yu PS (2013) Efficient algorithms for mining high utility itemsets from transactional databases. Trans Knowl Data Eng (TKDE) 25(8):1772–1786
van Leeuwen M, Knobbe A (2012) Diverse subgroup set discovery. Data Min Knowl Discov 25(2):208–242
Acknowledgements
We would like to thank the reviewers for taking the time and effort necessary to review the manuscript. We sincerely appreciate all valuable comments and suggestions, which helped us to improve the quality of the manuscript.
Funding
No funding was received to assist with the preparation of this manuscript. The initial work that lead to the first signature paper benefited from the funding of a 6-month internship in 2015 from the "Groupement des Mousquetaires" French retail group.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Responsible editor: M.J. Zaki.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gautrais, C., Cellier, P., Guyet, T. et al. Sky-signatures: detecting and characterizing recurrent behavior in sequential data. Data Min Knowl Disc 38, 372–419 (2024). https://doi.org/10.1007/s10618-023-00949-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-023-00949-1