Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

Lijffijt, Jefrey; Papapetrou, Panagiotis; Puolamäki, Kai

doi:10.1007/s10618-014-0397-3

Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

Published: 09 December 2014

Volume 29, pages 1838–1864, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jefrey Lijffijt^1,2,
Panagiotis Papapetrou³ &
Kai Puolamäki⁴

696 Accesses
2 Citations
Explore all metrics

Abstract

In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Omen: discovering sequential patterns with reliable prediction delays

Article Open access 05 March 2022

The Sliding-Window Computation Model and Results

Frequent Temporal Pattern Mining with Extended Lists

Notes

References

Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75:245–248
Article Google Scholar
Altmann EG, Pierrehumbert JB, Motter AE (2009) Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11):e7678
Article Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of SODA
Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27(2):573–580
Article MathSciNet Google Scholar
Biber D (1988) Variation across speech and writing. Cambridge University Press, Cambridge
Book Google Scholar
Bourgain C, Genin E, Quesneville H, Clerget-Darpoux F (2000) Search for multifactorial disease susceptibility genes in founder populations. Ann Hum Genet 64(3):255–265
Article Google Scholar
Calders T, Dexters N, Goethals B (2008) Mining frequent items in a stream using flexible windows. Intell Data Anal 12(3):293–304
Google Scholar
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of ACM SIGKDD, pp 493–498
Das MK, Dai HK (2007) A survey of DNA motif finding algorithms. BMC Bioinform 8(Suppl 7):S21
Article Google Scholar
Demaine ED, López-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: Proceedings of ESA, pp 348–360
Giannella C, Han J, Robertson E, Liu C (2003) Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report TR587, Indiana University
Golab L, DeHaan D, Demaine ED, López-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of IMC, pp 173–178
Gries ST (2008) Dispersions and adjusted frequencies in corpora. Int J Corpus Linguist 13(4):403–437
Article Google Scholar
Jin C, Yi K, Chen L, Yu JX, Lin X (2010) Sliding-window top-k queries on uncertain streams. VLDB J 19:411–435
Article Google Scholar
Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of IEEE ICDM, pp 210–217
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1):51–55
Article Google Scholar
Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2(1):15–59
Article Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Book Google Scholar
Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, Venter JC (2003) The dog genome: survey sequencing and comparative analysis. Science 301(5641):1898–1903
Article Google Scholar
Knobbe A, Blockeel H, Koopman A, Calders T, Obladen B, Bosma C, Galenkamp H, Koenders E, Kok J (2010) Infrawatch: data management of large systems for monitoring infrastructural performance. In: Proceedings of IDA, pp 91–102
Lee DYW (2001) Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. Lang Learn Technol 5(3):37–72
Google Scholar
Li C, Wang B, Yang X (2007a) VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of VLDB, pp 303–314
Li Y, Sung WK, Liu JJ (2007b) Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am J Hum Genet 80(4):705–715
Article Google Scholar
Li Y, Lin J, Oates T (2012) Visualizing variable-length time series motifs. In: Proceedings of SDM, pp 895–906
Lijffijt J, Papapetrou P, Puolamäki K, Mannila H (2011) Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. In: Proceedings of ECML-PKDD, pp 341–357
Lijffijt J, Papapetrou P, Puolamäki K (2012) Size matters: finding the most informative set of window lengths. In: Proceedings of ECML-PKDD, pp 451–466
Lin CH, Chiu DY, Wu YH, Chen ALP (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceedings of SDM
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of ICML, pp 545–552
Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289
Article Google Scholar
Mathias RA, Gao P, Goldstein JL, Wilson AF, Pugh EW, Furbert-Harris P, Dunston GM, Malveaux FJ, Togias A, Barnes KC, Beaty TH, Huang SK (2006) A graphical assessment of p-values from sliding window haplotype tests of association to identify asthma susceptibility loci on chromosome 11q. BMC Genet 7:38
Article Google Scholar
Mueen A (2013) Enumeration of time series motifs of all lengths. In: Proceedings of ICDM, pp 547–556
Mueen A, Keogh EJ, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motifs. In: Proceedings of SDM, pp 473–484
Pakhira MK (2008) Fast image segmentation using modified CLARA algorithm. In: Proceedings of ICIT, pp 14–18
Papadimitriou S, Yu P (2006) Optimal multi-scale patterns in time series streams. In: Proceedings of ACM SIGMOD, pp 647–658
Papapetrou P, Benson G, Kollios G (2006) Discovering frequent poly-regions in DNA sequences. In: Proceedings of IEEE ICDM workshops, pp 94–98
Papapetrou P, Benson G, Kollios G (2012) Mining poly-regions in DNA sequences. Int J Data Min Bioinform (IJDMB) 6(4):406–428
Article Google Scholar
Sörnmo L, Laguna P (2005) Bioelectrical signal processing in cardiac and neurological applications. Elsevier Academic Press, Amsterdam
Google Scholar
Tang R, Feng T, Sha Q, Zhang S (2009) A variable-sized sliding-window approach for genetic association studies via principal component analysis. Ann Hum Genet 73(Pt 6):631–637
Article Google Scholar
The British National Corpus (2007) Version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium http://www.natcorp.ox.ac.uk/
Toivonen H, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Herr M, Kere J (2000) Data mining applied to linkage disequilibrium mapping. Am J Hum Genet 67(1):133–145
Article Google Scholar
Vespier U, Knobbe A, Nijssen S, Vanschoren J (2012) MDL-based analysis of time series at multiple time-scales. In: Proceedings of ECML-PKDD, pp 371–386
Yingchareonthawornchai S, Sivaraks H, Rakthanmanon T, Ratanamahatana CA (2013) Efficient proper length time series motif discovery. In: Proceedings of ICDM, pp 1265–1270

Download references

Acknowledgments

We thank Heikki Mannila for useful discussions and feedback. This work was supported by the the Finnish Doctoral Programme in Computational Sciences (FICS), the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) and the Finnish Centre of Excellence in Computational Inference Research (COIN). We acknowledge the computational resources provided by Aalto Science-IT project.

Author information

Authors and Affiliations

Department of Engineering Mathematics, University of Bristol, MVB Woodland Road, BS8 1UB, Bristol, UK
Jefrey Lijffijt
Department of Information and Computer Science, Aalto University, Espoo, Finland
Jefrey Lijffijt
Department of Computer and Systems Sciences, Stockholm University, Forum 100, 164 40, Kista, Sweden
Panagiotis Papapetrou
Finnish Institute of Occupational Health, Topeliuksenkatu 41 a A, 00025, Helsinki, Finland
Kai Puolamäki

Authors

Jefrey Lijffijt
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar
Kai Puolamäki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jefrey Lijffijt.

Additional information

Responsible editor: Eamonn Keogh.

Appendix: Proof of Theorem 1

Preliminaries Let $(X_1,\ldots ,X_n)$ be a sequence of Bernoulli random variables with common parameter $p$, i.e., $X_{i} \in \left\{ {0,1}\right\} , {{\mathrm{Pr}}}\left( {\left\{ {X_{i} = 1}\right\} }\right) = p$, for all $i \in \left\{ {1,\ldots ,n}\right\} $. The random variables could, for example, denote the occurrences of an event. Similar to the notation for event sequences, we use $X_{i,\omega }$ to denote the subsequence of length $\omega $ starting at position $i$, $(X_i,\ldots ,X_{i+\omega -1})$. Let the statistic $f$ be the relative frequency of ones:

$$\begin{aligned} f(X_{i,\omega })=\frac{1}{\omega } \sum _{j=i}^{i+\omega -1}{X_j}. \end{aligned}$$

(5)

The selection of an optimal set of window lengths is based on the squared error between predictions made using those window lengths (Problem 1). Under the constraint of using a $k$-partition nearest neighbour regressor, the predictions correspond to the value of the nearest window length (Sect. 4.1). Thus, to select the optimal window lengths, we have to compute the distance (squared error) between all pairs of window lengths. We find that the distance between window lengths is as follows.

Theorem 1

For the statistic and generative process described above, the expected distance between two window lengths $\gamma $ and $\omega $, with $\gamma < \omega $, is

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] = \frac{\omega -\gamma }{\omega \gamma } p (1-p). \end{aligned}$$

Proof

The expected distance between two window lengths $\gamma $ and $\omega $ is

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] = {{\mathrm{E}}}\left[ {\frac{1}{n^{*}} \sum _{i=1}^{n^{*}} \left( f(X_{i,\gamma })-f(X_{i,\omega })\right) ^{2}}\right] . \end{aligned}$$

Since $X_1,\ldots ,X_n$ are i.i.d. random variables, this simplifies to

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] = {{\mathrm{E}}}\left[ {\left( f(X_{1,\gamma })-f(X_{1,\omega })\right) ^{2}}\right] . \end{aligned}$$

Assuming without loss of generality that $\gamma < \omega $, we find that

$$\begin{aligned} f(X_{1,\omega })&= \frac{1}{\omega } \sum _{j=1}^{\omega } X_{j}\\&= \frac{1}{\omega }\sum _{j=1}^{\gamma } X_{j}+\frac{1}{\omega } \sum _{j=1+\gamma }^{\omega } X_{j}\\&= \frac{\gamma }{\omega } f(X_{1,\gamma }) + \frac{\omega - \gamma }{\omega } f(X_{1+\gamma ,\omega -\gamma }). \end{aligned}$$

Thus we can rewrite the expected distance as

$$\begin{aligned}&{{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] \\&\quad = {{\mathrm{E}}}\left[ {\left( f(X_{1,\gamma }) - \frac{\gamma }{\omega } f(X_{1,\gamma }) - \frac{\omega -\gamma }{\omega } f(X_{1+\gamma ,\omega -\gamma })\right) ^{2}}\right] \\&\quad = {{\mathrm{E}}}\left[ {\left( \frac{\omega -\gamma }{\omega }\right) ^{2}\left( f(X_{1, \gamma }) - f(X_{1+\gamma ,\omega -\gamma })\right) ^{2}}\right] \\&\quad = \left( \frac{\omega -\gamma }{\omega }\right) ^{2} {{\mathrm{E}}}\left[ {\left( f(X_{1, \gamma }) - f(X_{1+\gamma ,\omega -\gamma })\right) ^{2}}\right] \\&\quad = \left( \frac{\omega -\gamma }{\omega }\right) ^{2} {{\mathrm{E}}}\left[ {f(X_{1, \gamma })^{2}}\right] + {{\mathrm{E}}}\left[ {f(X_{1+\gamma ,\omega -\gamma })^{2}}\right] - 2 {{\mathrm{E}}}\left[ {f(X_{1, \gamma }) f(X_{1+\gamma ,\omega -\gamma })}\right] . \end{aligned}$$

These three expectations are

$$\begin{aligned} {{\mathrm{E}}}\left[ {f(X_{1,\gamma })^{2}}\right]&= \frac{p(1-p)}{\gamma }+p^2,\\ {{\mathrm{E}}}\left[ {f(X_{1+\gamma ,\omega -\gamma })^{2}}\right]&= \frac{p(1-p)}{\omega -\gamma }+p^2, \text { and}\\ {{\mathrm{E}}}\left[ {f(X_{1,\gamma }) f(X_{1+\gamma ,\omega -\gamma })}\right]&= p^2. \end{aligned}$$

For brevity, we skip the derivation for these three expectations. They can be derived, for example, using the fact that the variance of a binomial distribution is ${{\mathrm{Var}}}\left[ {Bin(n,p)}\right] = {{\mathrm{E}}}\left[ {Bin(n,p)^2}\right] -{{\mathrm{E}}}\left[ {Bin(n,p)}\right] ^2 = np(1-p)$, and its expectation is ${{\mathrm{E}}}\left[ {Bin(n,p)}\right] = np$.

By writing out the expected distance we find that

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right]&= \left( \frac{\omega -\gamma }{\omega }\right) ^{2} \frac{p(1-p)}{\gamma }+p^2 + \frac{p(1-p)}{\omega -\gamma }+p^2 - 2 p^2\\&= \left( \frac{\omega -\gamma }{\omega }\right) ^{2} \left( \frac{1}{\gamma }+ \frac{1}{\omega -\gamma }\right) p(1-p)\\&= \frac{(\omega -\gamma )^2}{\omega ^2} \frac{\omega -\gamma +\gamma }{\gamma (\omega -\gamma )} p(1-p)\\&= \frac{\omega -\gamma }{\omega \gamma } p(1-p). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lijffijt, J., Papapetrou, P. & Puolamäki, K. Size matters: choosing the most informative set of window lengths for mining patterns in event sequences. Data Min Knowl Disc 29, 1838–1864 (2015). https://doi.org/10.1007/s10618-014-0397-3

Download citation

Received: 03 April 2014
Accepted: 24 November 2014
Published: 09 December 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s10618-014-0397-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

Abstract

Access this article

Similar content being viewed by others

Omen: discovering sequential patterns with reliable prediction delays

The Sliding-Window Computation Model and Results

Frequent Temporal Pattern Mining with Extended Lists

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proof of Theorem 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

Abstract

Access this article

Similar content being viewed by others

Omen: discovering sequential patterns with reliable prediction delays

The Sliding-Window Computation Model and Results

Frequent Temporal Pattern Mining with Extended Lists

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation