skip to main content
research-article

Discovering General Prominent Streaks in Sequence Data

Published:01 June 2014Publication History
Skip Abstract Section

Abstract

This article studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, such as consecutive games of outstanding performance in sports, consecutive hours of heavy network traffic, and consecutive days of frequent mentioning of a person in social media. Prominent streak discovery provides insightful data patterns for data analysis in many real-world applications and is an enabling technique for computational journalism. Given its real-world usefulness and complexity, the research on prominent streaks in sequence data opens a spectrum of challenging problems.

A baseline approach to finding prominent streaks is a quadratic algorithm that exhaustively enumerates all possible streaks and performs pairwise streak dominance comparison. For more efficient methods, we make the observation that prominent streaks are in fact skyline points in two dimensions—streak interval length and minimum value in the interval. Our solution thus hinges on the idea to separate the two steps in prominent streak discovery: candidate streak generation and skyline operation over candidate streaks. For candidate generation, we propose the concept of local prominent streak (LPS). We prove that prominent streaks are a subset of LPSs and the number of LPSs is less than the length of a data sequence, in comparison with the quadratic number of candidates produced by the brute-force baseline method. We develop efficient algorithms based on the concept of LPS. The nonlinear local prominent streak (NLPS)-based method considers a superset of LPSs as candidates, and the linear local prominent streak (LLPS)-based method further guarantees to consider only LPSs. The proposed properties and algorithms are also extended for discovering general top-k, multisequence, and multidimensional prominent streaks. The results of experiments using multiple real datasets verified the effectiveness of the proposed methods and showed orders of magnitude performance improvement against the baseline method.

References

  1. Rakesh Agrawal, Christos Faloutsos, and Arun Swami. 1993. Efficient similarity search in sequence databases. Lecture Notes in Computer Science, Vol. 730, 69--84. DOI: http://dx.doi.org/10.1007/3-540-57301-1_5Google ScholarGoogle ScholarCross RefCross Ref
  2. Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. 1995. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB’95). Morgan Kaufmann, San Francisco, CA, 490--501. http://dl.acm.org/citation.cfm?id=645921.673155 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 3, 403--410.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Communications of the ACM 18, 9 (September 1975), 509--517. DOI: http://dx.doi.org/10.1145/361002.361007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jon Louis Bentley. 1979. Multidimensional binary search trees in database applications. IEEE Transactions on Software Engineering 4, 333--340. DOI: http://dx.doi.org/10.1109/TSE.1979.234200 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Stephan Börzsönyi, Donald Kossmann, and Konrad Stocker. 2001. The Skyline operator. In Proceedings of the 17th International Conference on Data Engineering. 421--430. DOI: http://dx.doi.org/10.1109/ICDE.2001.914855 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chee-Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K. H. Tung, and Zhenjie Zhang. 2006. On high dimensional skylines. In Proceedings of the 10th International Conference on Advances in Database Technology. Springer-Verlag, Berlin, 478--495. DOI: http://dx.doi.org/10.1007/11687238_30 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jan Chomicki, Paolo Godfrey, Jarek Gryz, and Dongming Liang. 2003. Skyline with presorting. In Proceedings of the 19th International Conference on Data Engineering. 717--719. DOI: http://dx.doi.org/10.1109/ICDE.2003.1260846Google ScholarGoogle ScholarCross RefCross Ref
  10. Sarah Cohen, Chengkai Li, Jun Yang, and Cong Yu. 2011. Computational journalism: A call to arms to database researchers. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR’11). 148--151.Google ScholarGoogle Scholar
  11. Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. 1993. Fast Subsequence Matching in Time-Series Databases. University of Maryland at College Park, College Park, MD.Google ScholarGoogle Scholar
  12. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (January 1997), 29--53. DOI: http://dx.doi.org/10.1023/A:1009726021843 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bin Jiang and Jian Pei. 2009. Online interval skyline queries on time series. In Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE’09). IEEE Computer Society, Washington, DC, 1036--1047. DOI: http://dx.doi.org/10.1109/ICDE.2009.70 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xiao Jiang, Chengkai Li, Ping Luo, Min Wang, and Yong Yu. 2011. Prominent streak discovery in sequence data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 1280--1288. DOI: http://dx.doi.org/10.1145/2020408.2020601 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Donald Kossmann, Frank Ramsak, and Steffen Rost. 2002. Shooting stars in the sky: An online algorithm for skyline queries. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). 275--286. http://dl.acm.org/citation.cfm?id=1287369.1287394 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. T. Kung, F. Luccio, and F. P. Preparata. 1975. On finding the maxima of a set of vectors. Journal of the ACM 22, 4 (October 1975), 469--476. DOI: http://dx.doi.org/10.1145/321906.321910 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Warren Liao. 2005. Clustering of time series data—a survey. Pattern Recognition 38, 11 (November 2005), 1857--1874. DOI: http://dx.doi.org/10.1016/j.patcog.2005.01.025 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xuemin Lin, Yidong Yuan, Qing Zhang, and Ying Zhang. 2007. Selecting stars: The k most representative skyline operator. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE’07). 86--95. DOI: http://dx.doi.org/10.1109/ICDE.2007.367854Google ScholarGoogle ScholarCross RefCross Ref
  19. Tim Oates, Laura Firoiu, and Paul R. Cohen. 1999. Clustering time series with hidden Markov models and dynamic time warping. In Proceedings of the IJCAI-99 Workshop on Neural, Symbolic and Reinforcement Learning Methods for Sequence Learning. 17--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. 2005. Progressive skyline computation in database systems. ACM Transactions on Database Systems 30, 1 (March 2005), 41--82. DOI: http://dx.doi.org/10.1145/1061318.1061320 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. 2004. Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE Transactions on Knowledge and Data Engineering 16, 11 (November 2004), 1424--1440. DOI: http://dx.doi.org/10.1109/TKDE.2004.77 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jian Pei, Yidong Yuan, Xuemin Lin, Wen Jin, Martin Ester, Qing Liu, Wei Wang, Yufei Tao, Jeffrey Xu Yu, and Qing Zhang. 2006. Towards multidimensional subspace skyline analysis. ACM Transactions on Database Systems 31, 4 (December 2006), 1335--1381. DOI: http://dx.doi.org/10.1145/1189769.1189774 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lawrence Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2, 257--286. DOI: http://dx.doi.org/10.1109/5.18626Google ScholarGoogle ScholarCross RefCross Ref
  24. Young-In Shin and Donald Fussell. 2007. Parametric kernels for sequence data analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). Morgan Kaufmann, San Francisco, CA, 1047--1052. http://dl.acm.org/citation.cfm?id=1625275.1625445 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Padhraic Smyth. 1997. Clustering sequences with hidden Markov models. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 648--654.Google ScholarGoogle Scholar
  26. Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology. 3--17. DOI: http://dx.doi.org/10.1007/BFb0014140 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kian-Lee Tan, Pin-Kwang Eng, and Beng Chin Ooi. 2001. Efficient progressive skyline computation. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). Morgan Kaufmann, San Francisco, CA, 301--310. http://dl.acm.org/citation.cfm?id=645927.672217 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yufei Tao, Ling Ding, Xuemin Lin, and Jian Pei. 2009. Distance-based representative skyline. In Proceedings of the 2009 IEEE International Conference on Data Engineering. IEEE Computer Society, Washington, DC, 892--903. DOI: http://dx.doi.org/10.1109/ICDE.2009.84 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yufei Tao, Xiaokui Xiao, and Jian Pei. 2006. SUBSKY: Efficient computation of skylines in subspaces. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE Computer Society, Washington, DC, 65--. DOI: http://dx.doi.org/10.1109/ICDE.2006.149 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Min Wang and X. Sean Wang. 2006. Finding the plateau in an aggregated time series. In Proceedings of the 7th International Conference on Advances in Web-Age Information Management (WAIM’06). Springer-Verlag, Berlin, 325--336. DOI: http://dx.doi.org/10.1007/11775300_28 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Weng-Keen Wong. 2004. Data Mining for Early Disease Outbreak Detection. Ph.D. Dissertation. Pittsburgh, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tian Xia and Donghui Zhang. 2006. Refreshing the sky: The compressed skycube with efficient support for frequent updates. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD’06). ACM, New York, NY, 491--502. DOI: http://dx.doi.org/10.1145/1142473.1142529 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Xifeng Yan, Jiawei Han, and Ramin Afshar. 2003. CloSpan: Mining closed sequential patterns in large datasets. In Proceedings of SIAM International Conference on Data Mining. 166--177.Google ScholarGoogle ScholarCross RefCross Ref
  34. Byoung-Kee Yi, H. V. Jagadish, and Christos Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In Proceedings of the 14th International Conference on Data Engineering. 201--208. DOI: http://dx.doi.org/10.1109/ICDE.1998.655778 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mohammed J. Zaki. 2001. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning 42, 1--2, 31--60. DOI: http://dx.doi.org/10.1023/A:1007652502315 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhenjie Zhang, Xinyu Guo, Hua Lu, Anthony K. H. Tung, and Nan Wang. 2005. Discovering strong skyline points in high dimensional spaces. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05). ACM, New York, NY, 247--248. DOI: http://dx.doi.org/10.1145/1099554.1099610 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering General Prominent Streaks in Sequence Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 8, Issue 2
      June 2014
      161 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2630935
      Issue’s Table of Contents

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2014
      • Accepted: 1 August 2013
      • Received: 1 February 2013
      Published in tkdd Volume 8, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader