Skip to main content
Log in

Size matters: choosing the most informative set of window lengths for mining patterns in event sequences

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.uta.fi/sis/tauchi/virg/projects/dammoc/tve.html.

  2. Currently http://users.ics.aalto.fi/lijffijt.

  3. http://www.ncbi.nlm.nih.gov.

References

  • Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75:245–248

    Article  Google Scholar 

  • Altmann EG, Pierrehumbert JB, Motter AE (2009) Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11):e7678

    Article  Google Scholar 

  • Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of SODA

  • Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27(2):573–580

    Article  MathSciNet  Google Scholar 

  • Biber D (1988) Variation across speech and writing. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Bourgain C, Genin E, Quesneville H, Clerget-Darpoux F (2000) Search for multifactorial disease susceptibility genes in founder populations. Ann Hum Genet 64(3):255–265

    Article  Google Scholar 

  • Calders T, Dexters N, Goethals B (2008) Mining frequent items in a stream using flexible windows. Intell Data Anal 12(3):293–304

    Google Scholar 

  • Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of ACM SIGKDD, pp 493–498

  • Das MK, Dai HK (2007) A survey of DNA motif finding algorithms. BMC Bioinform 8(Suppl 7):S21

    Article  Google Scholar 

  • Demaine ED, López-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: Proceedings of ESA, pp 348–360

  • Giannella C, Han J, Robertson E, Liu C (2003) Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report TR587, Indiana University

  • Golab L, DeHaan D, Demaine ED, López-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of IMC, pp 173–178

  • Gries ST (2008) Dispersions and adjusted frequencies in corpora. Int J Corpus Linguist 13(4):403–437

    Article  Google Scholar 

  • Jin C, Yi K, Chen L, Yu JX, Lin X (2010) Sliding-window top-k queries on uncertain streams. VLDB J 19:411–435

    Article  Google Scholar 

  • Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of IEEE ICDM, pp 210–217

  • Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1):51–55

    Article  Google Scholar 

  • Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2(1):15–59

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Book  Google Scholar 

  • Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, Venter JC (2003) The dog genome: survey sequencing and comparative analysis. Science 301(5641):1898–1903

    Article  Google Scholar 

  • Knobbe A, Blockeel H, Koopman A, Calders T, Obladen B, Bosma C, Galenkamp H, Koenders E, Kok J (2010) Infrawatch: data management of large systems for monitoring infrastructural performance. In: Proceedings of IDA, pp 91–102

  • Lee DYW (2001) Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle. Lang Learn Technol 5(3):37–72

    Google Scholar 

  • Li C, Wang B, Yang X (2007a) VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: Proceedings of VLDB, pp 303–314

  • Li Y, Sung WK, Liu JJ (2007b) Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am J Hum Genet 80(4):705–715

    Article  Google Scholar 

  • Li Y, Lin J, Oates T (2012) Visualizing variable-length time series motifs. In: Proceedings of SDM, pp 895–906

  • Lijffijt J, Papapetrou P, Puolamäki K, Mannila H (2011) Analyzing word frequencies in large text corpora using inter-arrival times and bootstrapping. In: Proceedings of ECML-PKDD, pp 341–357

  • Lijffijt J, Papapetrou P, Puolamäki K (2012) Size matters: finding the most informative set of window lengths. In: Proceedings of ECML-PKDD, pp 451–466

  • Lin CH, Chiu DY, Wu YH, Chen ALP (2005) Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceedings of SDM

  • Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of ICML, pp 545–552

  • Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289

    Article  Google Scholar 

  • Mathias RA, Gao P, Goldstein JL, Wilson AF, Pugh EW, Furbert-Harris P, Dunston GM, Malveaux FJ, Togias A, Barnes KC, Beaty TH, Huang SK (2006) A graphical assessment of p-values from sliding window haplotype tests of association to identify asthma susceptibility loci on chromosome 11q. BMC Genet 7:38

    Article  Google Scholar 

  • Mueen A (2013) Enumeration of time series motifs of all lengths. In: Proceedings of ICDM, pp 547–556

  • Mueen A, Keogh EJ, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motifs. In: Proceedings of SDM, pp 473–484

  • Pakhira MK (2008) Fast image segmentation using modified CLARA algorithm. In: Proceedings of ICIT, pp 14–18

  • Papadimitriou S, Yu P (2006) Optimal multi-scale patterns in time series streams. In: Proceedings of ACM SIGMOD, pp 647–658

  • Papapetrou P, Benson G, Kollios G (2006) Discovering frequent poly-regions in DNA sequences. In: Proceedings of IEEE ICDM workshops, pp 94–98

  • Papapetrou P, Benson G, Kollios G (2012) Mining poly-regions in DNA sequences. Int J Data Min Bioinform (IJDMB) 6(4):406–428

    Article  Google Scholar 

  • Sörnmo L, Laguna P (2005) Bioelectrical signal processing in cardiac and neurological applications. Elsevier Academic Press, Amsterdam

    Google Scholar 

  • Tang R, Feng T, Sha Q, Zhang S (2009) A variable-sized sliding-window approach for genetic association studies via principal component analysis. Ann Hum Genet 73(Pt 6):631–637

    Article  Google Scholar 

  • The British National Corpus (2007) Version 3 (BNC XML Edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium http://www.natcorp.ox.ac.uk/

  • Toivonen H, Onkamo P, Vasko K, Ollikainen V, Sevon P, Mannila H, Herr M, Kere J (2000) Data mining applied to linkage disequilibrium mapping. Am J Hum Genet 67(1):133–145

    Article  Google Scholar 

  • Vespier U, Knobbe A, Nijssen S, Vanschoren J (2012) MDL-based analysis of time series at multiple time-scales. In: Proceedings of ECML-PKDD, pp 371–386

  • Yingchareonthawornchai S, Sivaraks H, Rakthanmanon T, Ratanamahatana CA (2013) Efficient proper length time series motif discovery. In: Proceedings of ICDM, pp 1265–1270

Download references

Acknowledgments

We thank Heikki Mannila for useful discussions and feedback. This work was supported by the the Finnish Doctoral Programme in Computational Sciences (FICS), the Finnish Centre of Excellence for Algorithmic Data Analysis Research (ALGODAN) and the Finnish Centre of Excellence in Computational Inference Research (COIN). We acknowledge the computational resources provided by Aalto Science-IT project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jefrey Lijffijt.

Additional information

Responsible editor: Eamonn Keogh.

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

Preliminaries Let \((X_1,\ldots ,X_n)\) be a sequence of Bernoulli random variables with common parameter \(p\), i.e., \(X_{i} \in \left\{ {0,1}\right\} , {{\mathrm{Pr}}}\left( {\left\{ {X_{i} = 1}\right\} }\right) = p\), for all \(i \in \left\{ {1,\ldots ,n}\right\} \). The random variables could, for example, denote the occurrences of an event. Similar to the notation for event sequences, we use \(X_{i,\omega }\) to denote the subsequence of length \(\omega \) starting at position \(i\), \((X_i,\ldots ,X_{i+\omega -1})\). Let the statistic \(f\) be the relative frequency of ones:

$$\begin{aligned} f(X_{i,\omega })=\frac{1}{\omega } \sum _{j=i}^{i+\omega -1}{X_j}. \end{aligned}$$
(5)

The selection of an optimal set of window lengths is based on the squared error between predictions made using those window lengths (Problem 1). Under the constraint of using a \(k\)-partition nearest neighbour regressor, the predictions correspond to the value of the nearest window length (Sect. 4.1). Thus, to select the optimal window lengths, we have to compute the distance (squared error) between all pairs of window lengths. We find that the distance between window lengths is as follows.

Theorem 1

For the statistic and generative process described above, the expected distance between two window lengths \(\gamma \) and \(\omega \), with \(\gamma < \omega \), is

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] = \frac{\omega -\gamma }{\omega \gamma } p (1-p). \end{aligned}$$

Proof

The expected distance between two window lengths \(\gamma \) and \(\omega \) is

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] = {{\mathrm{E}}}\left[ {\frac{1}{n^{*}} \sum _{i=1}^{n^{*}} \left( f(X_{i,\gamma })-f(X_{i,\omega })\right) ^{2}}\right] . \end{aligned}$$

Since \(X_1,\ldots ,X_n\) are i.i.d. random variables, this simplifies to

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] = {{\mathrm{E}}}\left[ {\left( f(X_{1,\gamma })-f(X_{1,\omega })\right) ^{2}}\right] . \end{aligned}$$

Assuming without loss of generality that \(\gamma < \omega \), we find that

$$\begin{aligned} f(X_{1,\omega })&= \frac{1}{\omega } \sum _{j=1}^{\omega } X_{j}\\&= \frac{1}{\omega }\sum _{j=1}^{\gamma } X_{j}+\frac{1}{\omega } \sum _{j=1+\gamma }^{\omega } X_{j}\\&= \frac{\gamma }{\omega } f(X_{1,\gamma }) + \frac{\omega - \gamma }{\omega } f(X_{1+\gamma ,\omega -\gamma }). \end{aligned}$$

Thus we can rewrite the expected distance as

$$\begin{aligned}&{{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right] \\&\quad = {{\mathrm{E}}}\left[ {\left( f(X_{1,\gamma }) - \frac{\gamma }{\omega } f(X_{1,\gamma }) - \frac{\omega -\gamma }{\omega } f(X_{1+\gamma ,\omega -\gamma })\right) ^{2}}\right] \\&\quad = {{\mathrm{E}}}\left[ {\left( \frac{\omega -\gamma }{\omega }\right) ^{2}\left( f(X_{1, \gamma }) - f(X_{1+\gamma ,\omega -\gamma })\right) ^{2}}\right] \\&\quad = \left( \frac{\omega -\gamma }{\omega }\right) ^{2} {{\mathrm{E}}}\left[ {\left( f(X_{1, \gamma }) - f(X_{1+\gamma ,\omega -\gamma })\right) ^{2}}\right] \\&\quad = \left( \frac{\omega -\gamma }{\omega }\right) ^{2} {{\mathrm{E}}}\left[ {f(X_{1, \gamma })^{2}}\right] + {{\mathrm{E}}}\left[ {f(X_{1+\gamma ,\omega -\gamma })^{2}}\right] - 2 {{\mathrm{E}}}\left[ {f(X_{1, \gamma }) f(X_{1+\gamma ,\omega -\gamma })}\right] . \end{aligned}$$

These three expectations are

$$\begin{aligned} {{\mathrm{E}}}\left[ {f(X_{1,\gamma })^{2}}\right]&= \frac{p(1-p)}{\gamma }+p^2,\\ {{\mathrm{E}}}\left[ {f(X_{1+\gamma ,\omega -\gamma })^{2}}\right]&= \frac{p(1-p)}{\omega -\gamma }+p^2, \text { and}\\ {{\mathrm{E}}}\left[ {f(X_{1,\gamma }) f(X_{1+\gamma ,\omega -\gamma })}\right]&= p^2. \end{aligned}$$

For brevity, we skip the derivation for these three expectations. They can be derived, for example, using the fact that the variance of a binomial distribution is \({{\mathrm{Var}}}\left[ {Bin(n,p)}\right] = {{\mathrm{E}}}\left[ {Bin(n,p)^2}\right] -{{\mathrm{E}}}\left[ {Bin(n,p)}\right] ^2 = np(1-p)\), and its expectation is \({{\mathrm{E}}}\left[ {Bin(n,p)}\right] = np\).

By writing out the expected distance we find that

$$\begin{aligned} {{\mathrm{E}}}\left[ {d(\omega ,\gamma )}\right]&= \left( \frac{\omega -\gamma }{\omega }\right) ^{2} \frac{p(1-p)}{\gamma }+p^2 + \frac{p(1-p)}{\omega -\gamma }+p^2 - 2 p^2\\&= \left( \frac{\omega -\gamma }{\omega }\right) ^{2} \left( \frac{1}{\gamma }+ \frac{1}{\omega -\gamma }\right) p(1-p)\\&= \frac{(\omega -\gamma )^2}{\omega ^2} \frac{\omega -\gamma +\gamma }{\gamma (\omega -\gamma )} p(1-p)\\&= \frac{\omega -\gamma }{\omega \gamma } p(1-p). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lijffijt, J., Papapetrou, P. & Puolamäki, K. Size matters: choosing the most informative set of window lengths for mining patterns in event sequences. Data Min Knowl Disc 29, 1838–1864 (2015). https://doi.org/10.1007/s10618-014-0397-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0397-3

Keywords

Navigation