ABSTRACT
Event sequences capture system and user activity over time. Prior research on sequence mining has mostly focused on discovering local patterns. Though interesting, these patterns reveal local associations and fail to give a comprehensive summary of the entire event sequence. Moreover, the number of patterns discovered can be large. In this paper, we take an alternative approach and build short summaries that describe the entire sequence, while revealing local associations among events.
We formally define the summarization problem as an optimization problem that balances between shortness of the summary and accuracy of the data description. We show that this problem can be solved optimally in polynomial time by using a combination of two dynamic-programming algorithms. We also explore more efficient greedy alternatives and demonstrate that they work well on large datasets. Experiments on both synthetic and real datasets illustrate that our algorithms are efficient and produce high-quality results, and reveal interesting local structures in the data.
- R. Agrawal and R. Srikant. Mining Sequential Patterns. In ICDE, 1995. Google ScholarDigital Library
- R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 1961. Google ScholarDigital Library
- D. Chudova and P. Smyth. Pattern discovery in sequences under a markov assumption. In KDD, pages 153--162, 2002. Google ScholarDigital Library
- S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In STOC, pages 471--475, 2001. Google ScholarDigital Library
- P. Karras, D. Sacharidis, and N. Mamoulis. Exploiting duality in summarization with deterministic guarantees. In KDD, pages 380--389, 2007. Google ScholarDigital Library
- E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289--296, 2001. Google ScholarDigital Library
- P. Kilpeläinen, H. Mannila, and E. Ukkonen. Mdl learning of unions of simple pattern languages from positive examples. In EuroCOLT, pages 252--260, 1995. Google ScholarDigital Library
- M. Koivisto, M. Perola, T. Varilo, et al. An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In Pacific Symposium on Biocomputing, pages 502--513, 2003.Google Scholar
- H. Mannila and M. Salmenkivi. Finding simple intensity descriptions from event sequence data. In KDD, pages 341--346, 2001. Google ScholarDigital Library
- H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In KDD , pages 146--151, 1996.Google Scholar
- M. Mehta, J. Rissanen, and R. Agrawal. Mdl-based decision tree pruning. In KDD, pages 216--221, 1995.Google Scholar
- S. Papadimitriou and P. Yu. Optimal multi-scale patterns in time series streams. In SIGMOD , pages 647--658, 2006. Google ScholarDigital Library
- J. Pei, J. Han, and W. Wang. Constraint-based sequential pattern mining: the pattern-growth methods. J. Intell. Inf. Syst., 28(2):133--160, 2007. Google ScholarDigital Library
- L. R. Rabiner and B. H. Juang. An introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4--15, January 1986.Google ScholarCross Ref
- J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.Google ScholarDigital Library
- J. Rissanen. Stochastic Complexity in Statistical Inquiry Theory. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 1989. Google ScholarDigital Library
- Y. Sakurai, S. Papadimitriou, and C. Faloutsos. Braid: Stream mining through group lag correlations. In SIGMOD, pages 599--610, 2005. Google ScholarDigital Library
- E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. In SDM, 2006.Google ScholarCross Ref
- J. Yang, W. Wang, P. S. Yu, and J. Han. Mining long sequential patterns in a noisy environment. In SIGMOD, pages 406--417, 2002. Google ScholarDigital Library
- Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB, pages 358--369, 2002. Google ScholarDigital Library
Index Terms
- Constructing comprehensive summaries of large event sequences
Recommendations
Constructing comprehensive summaries of large event sequences
Event sequences capture system and user activity over time. Prior research on sequence mining has mostly focused on discovering local patterns appearing in a sequence. While interesting, these patterns do not give a comprehensive summary of the entire ...
Discovery of Frequent Episodes in Event Sequences
Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering ...
Finding progression stages in time-evolving event sequences
WWW '14: Proceedings of the 23rd international conference on World wide webEvent sequences, such as patients' medical histories or users' sequences of product reviews, trace how individuals progress over time. Identifying common patterns, or progression stages, in such event sequences is a challenging task because not every ...
Comments