Skip to main content
Log in

Sky-signatures: detecting and characterizing recurrent behavior in sequential data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper proposes the sky-signature model, an extension of the signature model Gautrais et al. (in: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Springer, 2017b) to multi-objective optimization. The signature approach considers a sequence of itemsets, and given a number k it returns a segmentation of the sequence in k segments such that the number of items occuring in all segments is maximized. The limitation of this approach is that it requires to manually set k, and thus fixes the temporal granularity at which the data is analyzed. The sky-signature model proposed in this paper removes this requirement, and allows to examine the results at multiple levels of granularity, while keeping a compact output. This paper also proposes efficient algorithms to mine sky-signatures, as well as an experimental validation both real data both from the retail domain and from natural language processing (political speeches).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Algorithm 6
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://people.irisa.fr/Thomas.Guyet/demo_signatures/signatures.php.

  2. http://www.presidency.ucsb.edu/2016_election.php.

  3. Including the 3 presidential debates Speeches of Clinton prior to April 2015 were discarded.

  4. Climate change issues became a topic of interest when Clinton attacked Trump on him saying that climate change is a hoax in the first presidential debate (September 26, 2016).

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB), pp 207–216

  • Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering (ICDE), pp 3–14

  • Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 17th international conference on management of data, pp 207–216

  • Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284

    Article  MathSciNet  Google Scholar 

  • Bellman R (2013) Dynamic programming. Dover Publications, Inc., New York

    Google Scholar 

  • Bingham E (2010) Finding segmentations of sequences. In: Džeroski S, Goethals B, Panov P (eds) Inductive databases and constraint-based data mining. Springer, New York, pp 177–197

    Chapter  Google Scholar 

  • Borzsony S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings 17th international conference on data engineering, pp 421–430. https://doi.org/10.1109/ICDE.2001.914855

  • Bosc G, Boulicaut JF, Raïssi C, Kaytoue M (2018) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min Knowl Discov 32(3):604–650

    Article  MathSciNet  Google Scholar 

  • Casas-Garriga G (2003) Discovering unbounded episodes in sequential data. In: Proceedings of European conference on principles of data mining and knowledge discovery (ECML/PKDD), pp 83–94

  • Chundi P, Rosenkrantz DJ (2008) Efficient algorithms for segmentation of item-set time series. Data Min Knowl Discov 17(3):377–401

    Article  MathSciNet  Google Scholar 

  • Cueva PL, Bertaux A, Termier A, Méhaut J, Santana M (2012) Debugging embedded multimedia application traces through periodic pattern mining. In: Proceedings of the 12th international conference on embedded software, pp 13–22

  • Cule B, Goethals B, Robardet C (2009) A new constraint for mining sets in sequences. In: Proceedings of the SIAM international conference on data mining SDM’09, SIAM, pp 317–328

  • De Raedt L, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 7th international conference on data mining (ICDM), pp 237–248

  • Gautrais C, Cellier P, Quiniou R, Termier A (2017a) Topic signatures in political campaign speeches. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 2342–2347

  • Gautrais C, Quiniou R, Cellier P, Guyet T, Termier A (2017b) Purchase signatures of retail customers. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Springer, pp 110–121

  • Gautrais C, Cellier P, van Leeuwen M, Termier A (2020) Widening for MDL-based retail signature discovery. In: Berthold MR, Feelders A, Krempl G (eds) Advances in intelligent data analysis XVIII—18th international symposium on intelligent data analysis, IDA 2020, Konstanz, Germany, April 27–29, 2020, proceedings. Lecture notes in computer science, vol 12080. Springer, pp 197–209

  • Guns T, Nijssen S, De Raedt L (2013) k-pattern set mining under constraints. Trans Knowl Data Eng (TKDE) 25(2):402–418

    Article  Google Scholar 

  • Haiminen N, Gionis A (2004) Unimodal segmentation of sequences. In: Proceedings of the 4th international conference on data mining (ICDM), pp 106–113

  • Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. SIGMOD Rec 29(2):1–12

    Article  Google Scholar 

  • Han J, Wang J, Lu Y, Tzvetkov P (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings of the international conference on data mining (ICDM), pp 211–218

  • Kiernan J, Terzi E (2009) Constructing comprehensive summaries of large event sequences. ACM Trans Knowl Discov Data. https://doi.org/10.1145/1631162.1631169

    Article  Google Scholar 

  • Kung HT, Luccio F, Preparata FP (1975) On finding the maxima of a set of vectors. J ACM 22(4):469–476. https://doi.org/10.1145/321906.321910

    Article  MathSciNet  Google Scholar 

  • Ma S, Hellerstein JL (2001) Mining partially periodic event patterns with unknown periods. In: Proceedings of the 17th international conference on data engineering (ICDE), pp 205–214

  • Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289

    Article  Google Scholar 

  • Miguéis VL, Camanho AS, Falcão e Cunha J (2011) Mining customer loyalty card programs: the improvement of service levels enabled by innovative segmentation and promotions design. In: Proceedings of the international conference on exploring services science (IESS), pp 83–97

  • Miguéis VL, Camanho AS, Falcão e Cunha J (2012) Customer data mining for lifestyle segmentation. Expert Syst Appl 39(10):9359–9366

    Article  Google Scholar 

  • Naturel X, Gros P (2008) Detecting repeats for video structuring. Multimed Tools Appl 38(2):233–252

    Article  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory (ICDT), pp 398–416

  • Pei J, Han J, Mortazavi-Asl B, Pinto H (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the international conference on data engineering (ICDE), pp 215–224

  • Soulet A, Raïssi C, Plantevit M, Cremilleux B (2011) Mining dominant patterns in the sky. In: Proceedings of the 11th international conference on data mining (ICDM), pp 655–664

  • Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66

    Article  MathSciNet  Google Scholar 

  • Terzi E, Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: Proceedings of the SIAM conference on data mining (SDM), pp 314–325

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423

    Article  MathSciNet  Google Scholar 

  • Tseng VS, Shie BE, Wu CW, Yu PS (2013) Efficient algorithms for mining high utility itemsets from transactional databases. Trans Knowl Data Eng (TKDE) 25(8):1772–1786

    Article  Google Scholar 

  • van Leeuwen M, Knobbe A (2012) Diverse subgroup set discovery. Data Min Knowl Discov 25(2):208–242

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank the reviewers for taking the time and effort necessary to review the manuscript. We sincerely appreciate all valuable comments and suggestions, which helped us to improve the quality of the manuscript.

Funding

No funding was received to assist with the preparation of this manuscript. The initial work that lead to the first signature paper benefited from the funding of a 6-month internship in 2015 from the "Groupement des Mousquetaires" French retail group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Clément Gautrais.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Responsible editor: M.J. Zaki.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gautrais, C., Cellier, P., Guyet, T. et al. Sky-signatures: detecting and characterizing recurrent behavior in sequential data. Data Min Knowl Disc 38, 372–419 (2024). https://doi.org/10.1007/s10618-023-00949-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-023-00949-1

Keywords

Navigation