Abstract
This article studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, such as consecutive games of outstanding performance in sports, consecutive hours of heavy network traffic, and consecutive days of frequent mentioning of a person in social media. Prominent streak discovery provides insightful data patterns for data analysis in many real-world applications and is an enabling technique for computational journalism. Given its real-world usefulness and complexity, the research on prominent streaks in sequence data opens a spectrum of challenging problems.
A baseline approach to finding prominent streaks is a quadratic algorithm that exhaustively enumerates all possible streaks and performs pairwise streak dominance comparison. For more efficient methods, we make the observation that prominent streaks are in fact skyline points in two dimensions—streak interval length and minimum value in the interval. Our solution thus hinges on the idea to separate the two steps in prominent streak discovery: candidate streak generation and skyline operation over candidate streaks. For candidate generation, we propose the concept of local prominent streak (LPS). We prove that prominent streaks are a subset of LPSs and the number of LPSs is less than the length of a data sequence, in comparison with the quadratic number of candidates produced by the brute-force baseline method. We develop efficient algorithms based on the concept of LPS. The nonlinear local prominent streak (NLPS)-based method considers a superset of LPSs as candidates, and the linear local prominent streak (LLPS)-based method further guarantees to consider only LPSs. The proposed properties and algorithms are also extended for discovering general top-k, multisequence, and multidimensional prominent streaks. The results of experiments using multiple real datasets verified the effectiveness of the proposed methods and showed orders of magnitude performance improvement against the baseline method.
- Rakesh Agrawal, Christos Faloutsos, and Arun Swami. 1993. Efficient similarity search in sequence databases. Lecture Notes in Computer Science, Vol. 730, 69--84. DOI: http://dx.doi.org/10.1007/3-540-57301-1_5Google ScholarCross Ref
- Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. 1995. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB’95). Morgan Kaufmann, San Francisco, CA, 490--501. http://dl.acm.org/citation.cfm?id=645921.673155 Google ScholarDigital Library
- Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. 3--14. Google ScholarDigital Library
- Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 3, 403--410.Google ScholarCross Ref
- Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Communications of the ACM 18, 9 (September 1975), 509--517. DOI: http://dx.doi.org/10.1145/361002.361007 Google ScholarDigital Library
- Jon Louis Bentley. 1979. Multidimensional binary search trees in database applications. IEEE Transactions on Software Engineering 4, 333--340. DOI: http://dx.doi.org/10.1109/TSE.1979.234200 Google ScholarDigital Library
- Stephan Börzsönyi, Donald Kossmann, and Konrad Stocker. 2001. The Skyline operator. In Proceedings of the 17th International Conference on Data Engineering. 421--430. DOI: http://dx.doi.org/10.1109/ICDE.2001.914855 Google ScholarDigital Library
- Chee-Yong Chan, H. V. Jagadish, Kian-Lee Tan, Anthony K. H. Tung, and Zhenjie Zhang. 2006. On high dimensional skylines. In Proceedings of the 10th International Conference on Advances in Database Technology. Springer-Verlag, Berlin, 478--495. DOI: http://dx.doi.org/10.1007/11687238_30 Google ScholarDigital Library
- Jan Chomicki, Paolo Godfrey, Jarek Gryz, and Dongming Liang. 2003. Skyline with presorting. In Proceedings of the 19th International Conference on Data Engineering. 717--719. DOI: http://dx.doi.org/10.1109/ICDE.2003.1260846Google ScholarCross Ref
- Sarah Cohen, Chengkai Li, Jun Yang, and Cong Yu. 2011. Computational journalism: A call to arms to database researchers. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR’11). 148--151.Google Scholar
- Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. 1993. Fast Subsequence Matching in Time-Series Databases. University of Maryland at College Park, College Park, MD.Google Scholar
- Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (January 1997), 29--53. DOI: http://dx.doi.org/10.1023/A:1009726021843 Google ScholarDigital Library
- Bin Jiang and Jian Pei. 2009. Online interval skyline queries on time series. In Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE’09). IEEE Computer Society, Washington, DC, 1036--1047. DOI: http://dx.doi.org/10.1109/ICDE.2009.70 Google ScholarDigital Library
- Xiao Jiang, Chengkai Li, Ping Luo, Min Wang, and Yong Yu. 2011. Prominent streak discovery in sequence data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 1280--1288. DOI: http://dx.doi.org/10.1145/2020408.2020601 Google ScholarDigital Library
- Donald Kossmann, Frank Ramsak, and Steffen Rost. 2002. Shooting stars in the sky: An online algorithm for skyline queries. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). 275--286. http://dl.acm.org/citation.cfm?id=1287369.1287394 Google ScholarDigital Library
- H. T. Kung, F. Luccio, and F. P. Preparata. 1975. On finding the maxima of a set of vectors. Journal of the ACM 22, 4 (October 1975), 469--476. DOI: http://dx.doi.org/10.1145/321906.321910 Google ScholarDigital Library
- T. Warren Liao. 2005. Clustering of time series data—a survey. Pattern Recognition 38, 11 (November 2005), 1857--1874. DOI: http://dx.doi.org/10.1016/j.patcog.2005.01.025 Google ScholarDigital Library
- Xuemin Lin, Yidong Yuan, Qing Zhang, and Ying Zhang. 2007. Selecting stars: The k most representative skyline operator. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE’07). 86--95. DOI: http://dx.doi.org/10.1109/ICDE.2007.367854Google ScholarCross Ref
- Tim Oates, Laura Firoiu, and Paul R. Cohen. 1999. Clustering time series with hidden Markov models and dynamic time warping. In Proceedings of the IJCAI-99 Workshop on Neural, Symbolic and Reinforcement Learning Methods for Sequence Learning. 17--21. Google ScholarDigital Library
- Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. 2005. Progressive skyline computation in database systems. ACM Transactions on Database Systems 30, 1 (March 2005), 41--82. DOI: http://dx.doi.org/10.1145/1061318.1061320 Google ScholarDigital Library
- Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. 2004. Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE Transactions on Knowledge and Data Engineering 16, 11 (November 2004), 1424--1440. DOI: http://dx.doi.org/10.1109/TKDE.2004.77 Google ScholarDigital Library
- Jian Pei, Yidong Yuan, Xuemin Lin, Wen Jin, Martin Ester, Qing Liu, Wei Wang, Yufei Tao, Jeffrey Xu Yu, and Qing Zhang. 2006. Towards multidimensional subspace skyline analysis. ACM Transactions on Database Systems 31, 4 (December 2006), 1335--1381. DOI: http://dx.doi.org/10.1145/1189769.1189774 Google ScholarDigital Library
- Lawrence Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 2, 257--286. DOI: http://dx.doi.org/10.1109/5.18626Google ScholarCross Ref
- Young-In Shin and Donald Fussell. 2007. Parametric kernels for sequence data analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). Morgan Kaufmann, San Francisco, CA, 1047--1052. http://dl.acm.org/citation.cfm?id=1625275.1625445 Google ScholarDigital Library
- Padhraic Smyth. 1997. Clustering sequences with hidden Markov models. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 648--654.Google Scholar
- Ramakrishnan Srikant and Rakesh Agrawal. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology. 3--17. DOI: http://dx.doi.org/10.1007/BFb0014140 Google ScholarDigital Library
- Kian-Lee Tan, Pin-Kwang Eng, and Beng Chin Ooi. 2001. Efficient progressive skyline computation. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). Morgan Kaufmann, San Francisco, CA, 301--310. http://dl.acm.org/citation.cfm?id=645927.672217 Google ScholarDigital Library
- Yufei Tao, Ling Ding, Xuemin Lin, and Jian Pei. 2009. Distance-based representative skyline. In Proceedings of the 2009 IEEE International Conference on Data Engineering. IEEE Computer Society, Washington, DC, 892--903. DOI: http://dx.doi.org/10.1109/ICDE.2009.84 Google ScholarDigital Library
- Yufei Tao, Xiaokui Xiao, and Jian Pei. 2006. SUBSKY: Efficient computation of skylines in subspaces. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). IEEE Computer Society, Washington, DC, 65--. DOI: http://dx.doi.org/10.1109/ICDE.2006.149 Google ScholarDigital Library
- Min Wang and X. Sean Wang. 2006. Finding the plateau in an aggregated time series. In Proceedings of the 7th International Conference on Advances in Web-Age Information Management (WAIM’06). Springer-Verlag, Berlin, 325--336. DOI: http://dx.doi.org/10.1007/11775300_28 Google ScholarDigital Library
- Weng-Keen Wong. 2004. Data Mining for Early Disease Outbreak Detection. Ph.D. Dissertation. Pittsburgh, PA. Google ScholarDigital Library
- Tian Xia and Donghui Zhang. 2006. Refreshing the sky: The compressed skycube with efficient support for frequent updates. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD’06). ACM, New York, NY, 491--502. DOI: http://dx.doi.org/10.1145/1142473.1142529 Google ScholarDigital Library
- Xifeng Yan, Jiawei Han, and Ramin Afshar. 2003. CloSpan: Mining closed sequential patterns in large datasets. In Proceedings of SIAM International Conference on Data Mining. 166--177.Google ScholarCross Ref
- Byoung-Kee Yi, H. V. Jagadish, and Christos Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In Proceedings of the 14th International Conference on Data Engineering. 201--208. DOI: http://dx.doi.org/10.1109/ICDE.1998.655778 Google ScholarDigital Library
- Mohammed J. Zaki. 2001. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning 42, 1--2, 31--60. DOI: http://dx.doi.org/10.1023/A:1007652502315 Google ScholarDigital Library
- Zhenjie Zhang, Xinyu Guo, Hua Lu, Anthony K. H. Tung, and Nan Wang. 2005. Discovering strong skyline points in high dimensional spaces. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05). ACM, New York, NY, 247--248. DOI: http://dx.doi.org/10.1145/1099554.1099610 Google ScholarDigital Library
Index Terms
Discovering General Prominent Streaks in Sequence Data
Recommendations
Prominent streak discovery in sequence data
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningThis paper studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values. For finding prominent streaks, we make the ...
Photorealistic rendering of rain streaks
Photorealistic rendering of rain streaks with lighting and viewpoint effects is a challenging problem. Raindrops undergo rapid shape distortions as they fall, a phenomenon referred to as oscillations. Due to these oscillations, the reflection of light ...
Towards multidimensional subspace skyline analysis
The skyline operator is important for multicriteria decision-making applications. Although many recent studies developed efficient methods to compute skyline objects in a given space, none of them considers skylines in multiple subspaces simultaneously. ...
Comments