skip to main content
10.1145/1401890.1401943acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Constructing comprehensive summaries of large event sequences

Published:24 August 2008Publication History

ABSTRACT

Event sequences capture system and user activity over time. Prior research on sequence mining has mostly focused on discovering local patterns. Though interesting, these patterns reveal local associations and fail to give a comprehensive summary of the entire event sequence. Moreover, the number of patterns discovered can be large. In this paper, we take an alternative approach and build short summaries that describe the entire sequence, while revealing local associations among events.

We formally define the summarization problem as an optimization problem that balances between shortness of the summary and accuracy of the data description. We show that this problem can be solved optimally in polynomial time by using a combination of two dynamic-programming algorithms. We also explore more efficient greedy alternatives and demonstrate that they work well on large datasets. Experiments on both synthetic and real datasets illustrate that our algorithms are efficient and produce high-quality results, and reveal interesting local structures in the data.

References

  1. R. Agrawal and R. Srikant. Mining Sequential Patterns. In ICDE, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 1961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Chudova and P. Smyth. Pattern discovery in sequences under a markov assumption. In KDD, pages 153--162, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In STOC, pages 471--475, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Karras, D. Sacharidis, and N. Mamoulis. Exploiting duality in summarization with deterministic guarantees. In KDD, pages 380--389, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289--296, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Kilpeläinen, H. Mannila, and E. Ukkonen. Mdl learning of unions of simple pattern languages from positive examples. In EuroCOLT, pages 252--260, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Koivisto, M. Perola, T. Varilo, et al. An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries. In Pacific Symposium on Biocomputing, pages 502--513, 2003.Google ScholarGoogle Scholar
  9. H. Mannila and M. Salmenkivi. Finding simple intensity descriptions from event sequence data. In KDD, pages 341--346, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In KDD , pages 146--151, 1996.Google ScholarGoogle Scholar
  11. M. Mehta, J. Rissanen, and R. Agrawal. Mdl-based decision tree pruning. In KDD, pages 216--221, 1995.Google ScholarGoogle Scholar
  12. S. Papadimitriou and P. Yu. Optimal multi-scale patterns in time series streams. In SIGMOD , pages 647--658, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Pei, J. Han, and W. Wang. Constraint-based sequential pattern mining: the pattern-growth methods. J. Intell. Inf. Syst., 28(2):133--160, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. R. Rabiner and B. H. Juang. An introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4--15, January 1986.Google ScholarGoogle ScholarCross RefCross Ref
  15. J. Rissanen. Modeling by shortest data description. Automatica, 14:465--471, 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Rissanen. Stochastic Complexity in Statistical Inquiry Theory. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Sakurai, S. Papadimitriou, and C. Faloutsos. Braid: Stream mining through group lag correlations. In SIGMOD, pages 599--610, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Terzi and P. Tsaparas. Efficient algorithms for sequence segmentation. In SDM, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Yang, W. Wang, P. S. Yu, and J. Han. Mining long sequential patterns in a noisy environment. In SIGMOD, pages 406--417, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB, pages 358--369, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Constructing comprehensive summaries of large event sequences

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
            August 2008
            1116 pages
            ISBN:9781605581934
            DOI:10.1145/1401890
            • General Chair:
            • Ying Li,
            • Program Chairs:
            • Bing Liu,
            • Sunita Sarawagi

            Copyright © 2008 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 August 2008

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader