Abstract
Given the recent explosion of interest in streaming data and online algorithms, clustering of time-series subsequences, extracted via a sliding window, has received much attention. In this work, we make a surprising claim. Clustering of time-series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising because it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method that, based on the concept of time-series motifs, is able to meaningfully cluster subsequences on some time-series datasets.
Similar content being viewed by others
References
Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data. Washington, DC, 26–28 May, pp 207–216
Bar-Joseph Z, Gerber G, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time-series data. In: Proceedings of the 6th annual international conference on research in computational molecular biology. Washington, DC, 18–21 Apr, pp 39–48
Bradley PS, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of the 15th international conference on machine learning. Madison, WI, 24–27 July, pp 91–99
British Irish Society, Species Group Staff (1997) A guide to species irises: their identification and cultivation. Cambridge University Press
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC, USA, 24–27 Aug, pp 493–498
Cotofrei P (2002) Statistical temporal rules. In: Proceedings of the 15th conference on computational statistics—short communications and posters. Berlin, Germany, 24–28 Aug
Cotofrei P, Stoffel K (2002) Classification rules + time = temporal rules. In: Proceedings of the 2002 international conference on computational science. Amsterdam, The Netherlands, 21–24 Apr, pp 572–581
Das G, Lin K, Mannila H, Renganathan G, Smyth P (1998) Rule discovery from time series. In: Proceedings of the 4th international conference on knowledge discovery and data mining. New York, NY, 27–31 Aug, pp 16–22
Denton A (2003) Personal communication. Dec
Fisher RA (1936) The use of multiple measures in taxonomic problems. Ann of Eugen 7:179–188
Fu TC, Chung FL, Ng V, Luk R (2001) Pattern discovery from stock time series using self-organizing maps. Workshop notes of the workshop on temporal data mining at the 7th ACM SIGKDD international conference on knowledge discovery and data mining. San Francisco, CA, 26–29 Aug, pp 27–37
Gavrilov M, Anguelov D, Indyk P, Motwani R (2000) Mining the stock market: which measure is best? In: Proceedings of the 6th ACM international conference on knowledge discovery and data mining. Boston, MA, 20–23 Aug, pp 487–496
Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science. Redondo Beach, CA, 12–14 Nov, pp 359–366
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intellt Inf Syst 17:107–145
Harms SK, Deogun J, Tadesse T (2002a) Discovering sequential association rules with constraints and time lags in multiple sequences. In: Proceedings of the 13th international symposium on methodologies for intelligent systems. Lyon, France, 27–29 Jun, pp 432–441
Harms SK, Reichenbach S, Goddard SE, Tadesse T, Waltman WJ (2002b) Data mining in a geospatial decision support system for drought risk management. In: Proceedings of the 1st national conference on digital government. Los Angeles, CA, 21–23 May, pp 9–16
Hetland ML, Satrom P (2002) Temporal rules discovery using genetic programming and specialized hardware. In: Proceedings of the 4th international conference on recent advances in soft computing. Nottingham, UK, 12–13 Dec
Honda R, Wang S, Kikuchi T, Konishi O (2002) Mining of moving objects from time-series images and its application to satellite weather imagery. J Intell Inf Syst 19:79–93
Hoppner F (2002) Time series abstraction methods—a survey. In: Tagungsband zur 32. GI Jahrestagung 2002, Workshop on knowledge discovery in databases. Dortmund, Sept/Oct, pp 777–786
Jensen D (2000) Data snooping, dredging and fishing: the dark side of data mining. 1999 SIGKDD panel report. ACM SIGKDD Explor 1:52–54
Jin X, Lu Y, Shi C (2002a) Distribution discovery: local analysis of temporal rules. In: Proceedings of the 6th Pacific-Asia conference on knowledge discovery and data mining. Taipei, Taiwan, 6–8 May, pp 469–480
Jin X, Wang L, Lu Y, Shi C (2002b) Indexing and mining of the local patterns in sequence database. In: Proceedings of the 3rd international conference on intelligent data engineering and automated learning. Manchester, UK, 12–14 Aug, pp 68–73
Kendall M (1976) Time-series, 2nd ed. Griffin, London
Keogh E (2002a) Exact indexing of dynamic time warping. In: Proceedings of the 28th international conference on very large data bases. Hong Kong, 20–23 Aug, pp 406–417
Keogh E (2002b) The UCR time series data mining archive. Computer Science & Engineering Department, University of California, Riverside, CA. http://www.cs.ucr.edu/∼eamonn/TSDMA/index.html
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. J Knowl Inf Syst 3:263–286
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, Canada, 23–26 July, pp 102–111
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proceedings of the 3rd IEEE international conference on data mining. Melbourne, FL, 19–22 Nov, pp 115–122
Li C, Yu PS, Castelli V (1998) MALM: a framework for mining sequence database at multiple abstraction levels. In: Proceedings of the 7th ACM International conference on information and knowledge management. Bethesda, MD, 3–7 Nov, pp 267–272
Lin J, Keogh E, Patel P, Lonardi S (2002) Finding motifs in time series. Workshop notes of the 2nd workshop on temporal data mining at the 8th ACM international conference on knowledge discovery and data mining. Edmonton, Alberta, Canada, 23–26 July
Mantegna RN (1999) Hierarchical structure in financial markets. Eur Physical J B 11:193–197
Mori T, Uehara K (2001) Extraction of primitive motion and discovery of association rules from human motion. In: Proceedings of the 10th IEEE international workshop on robot and human communication. Bordeaux-Paris, France, 18–21 Sept, pp 200–206
Nanni L (2003) Personal communication. 22 Apr
Oates T (1999) Identifying distinctive subsequences in multivariate time series by clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining. San Diego, CA, 15–18 Aug, pp 322–326
Ohsaki M, Sato Y, Yokoi H, Yamaguchi T (2002) A rule discovery support system for sequential medical data, in the case study of a chronic hepatitis dataset. Workshop notes of the international workshop on active mining at IEEE international conference on data mining. Maebashi, Japan, 9–12 Dec
Ohsaki M, Sato Y, Yokoi H, Yamaguchi T (2003) A rule discovery support system for sequential medical data, in the case study of a chronic hepatitis dataset. Workshop notes of discovery challenge workshop at the 14th European conference on machine learning/the 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, 22–26 Sep
Osaki R, Shimada M, Uehara K (2000) A motion recognition method by using primitive motions. In: Arisawa H, Catarci T (eds) Advances in visual information management: visual database systems. Kluwer, pp 117–127
Perlman E, Java A (2003) Predictive mining of time series data. In: Payne HE, Jedrzejewski RI, Hook RN (eds) ASP conference series, vol 295, Astronomical data analysis software and systems XII. San Francisco, pp 431–434
Povinelli R (2003) Personal communication. 19 Sept
Radhakrishnan N, Wilson JD, Loizou PC (2000) An alternative partitioning technique to quantify the regularity of complex time series. Int J Bifur Chaos 10:1773–1779
Reinert, G, Schbath, S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46
Roddick JF, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. Trans Data Eng 14:750–767
Sarker BK, Mori T, Uehara K (2002) Parallel algorithms for mining association rules in time series data. CS24-2002-1, Technical report
Schittenkopf C, Tino P, Dorffner G (2000) The benefit of information reduction for trading strategies. Report series for adaptive information systems and management in economics and management society. July. Report #45
Steinback M, Tan PN, Kumar V, Klooster S, Potter C (2002) Temporal data mining for the discovery and analysis of ocean climate indices. Workshop notes of the 2nd workshop on temporal data mining at the 8th ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, Canada, 23 July
Timmermann A, Sullivan R, White H (1998) The dangers of data-driven inference: the case of calendar effects in stock returns. FMG discussion papers dp0304, Financial Markets Group and ESRC
Tino P, Schittenkopf C, Dorffner G (2000) Temporal pattern recognition in noisy non-stationary time series based on quantization into symbolic streams: lessons learned from financial volatility trading. Report series for adaptive information systems and management in economics and management sciences. July. Report #46
Truppel W, Keogh E, Lin J (2003) A hidden constraint when clustering streaming time series. UCR technical report
Uehara K, Shimada M (2002) Extraction of primitive motion and discovery of association rules from human motion data. Progress in discovery science. Lecture notes in artificial intelligence, vol 2281. Springer, Berlin, Heidelberg, New York, pp 338–348
van Laerhoven K (2001) Combining the Kohonen self-organizing map and K-means for on-line classification of sensor data. In: Dorffner G, Bischof H, Hornik K (eds) Artificial neural networks. Lecture notes in artificial intelligence, vol 2130. Springer, Berlin, Heidelberg, New York, pp 464–470
Walker J (2001) HotBits: genuine random numbers generated by radioactive decay. http://www.fourmilab.ch/hotbits
Yairi Y, Kato Y, Hori K (2001) Fault detection by mining association rules in house-keeping data. In: Proceedings of the 6th international symposium on artificial intelligence, robotics and automation in space. Montreal, Canada, 18–21 Jun
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Keogh, E., Lin, J. Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8, 154–177 (2005). https://doi.org/10.1007/s10115-004-0172-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0172-7