Skip to main content
Log in

Clustering of time-series subsequences is meaningless: implications for previous and future research

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Given the recent explosion of interest in streaming data and online algorithms, clustering of time-series subsequences, extracted via a sliding window, has received much attention. In this work, we make a surprising claim. Clustering of time-series subsequences is meaningless. More concretely, clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature. We can justify calling our claim surprising because it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method that, based on the concept of time-series motifs, is able to meaningfully cluster subsequences on some time-series datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data. Washington, DC, 26–28 May, pp 207–216

  2. Bar-Joseph Z, Gerber G, Gifford D, Jaakkola T, Simon I (2002) A new approach to analyzing gene expression time-series data. In: Proceedings of the 6th annual international conference on research in computational molecular biology. Washington, DC, 18–21 Apr, pp 39–48

  3. Bradley PS, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of the 15th international conference on machine learning. Madison, WI, 24–27 July, pp 91–99

  4. British Irish Society, Species Group Staff (1997) A guide to species irises: their identification and cultivation. Cambridge University Press

    Google Scholar 

  5. Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining. Washington, DC, USA, 24–27 Aug, pp 493–498

  6. Cotofrei P (2002) Statistical temporal rules. In: Proceedings of the 15th conference on computational statistics—short communications and posters. Berlin, Germany, 24–28 Aug

  7. Cotofrei P, Stoffel K (2002) Classification rules + time = temporal rules. In: Proceedings of the 2002 international conference on computational science. Amsterdam, The Netherlands, 21–24 Apr, pp 572–581

  8. Das G, Lin K, Mannila H, Renganathan G, Smyth P (1998) Rule discovery from time series. In: Proceedings of the 4th international conference on knowledge discovery and data mining. New York, NY, 27–31 Aug, pp 16–22

  9. Denton A (2003) Personal communication. Dec

  10. Fisher RA (1936) The use of multiple measures in taxonomic problems. Ann of Eugen 7:179–188

    Article  Google Scholar 

  11. Fu TC, Chung FL, Ng V, Luk R (2001) Pattern discovery from stock time series using self-organizing maps. Workshop notes of the workshop on temporal data mining at the 7th ACM SIGKDD international conference on knowledge discovery and data mining. San Francisco, CA, 26–29 Aug, pp 27–37

  12. Gavrilov M, Anguelov D, Indyk P, Motwani R (2000) Mining the stock market: which measure is best? In: Proceedings of the 6th ACM international conference on knowledge discovery and data mining. Boston, MA, 20–23 Aug, pp 487–496

  13. Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science. Redondo Beach, CA, 12–14 Nov, pp 359–366

  14. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intellt Inf Syst 17:107–145

    Article  Google Scholar 

  15. Harms SK, Deogun J, Tadesse T (2002a) Discovering sequential association rules with constraints and time lags in multiple sequences. In: Proceedings of the 13th international symposium on methodologies for intelligent systems. Lyon, France, 27–29 Jun, pp 432–441

  16. Harms SK, Reichenbach S, Goddard SE, Tadesse T, Waltman WJ (2002b) Data mining in a geospatial decision support system for drought risk management. In: Proceedings of the 1st national conference on digital government. Los Angeles, CA, 21–23 May, pp 9–16

  17. Hetland ML, Satrom P (2002) Temporal rules discovery using genetic programming and specialized hardware. In: Proceedings of the 4th international conference on recent advances in soft computing. Nottingham, UK, 12–13 Dec

  18. Honda R, Wang S, Kikuchi T, Konishi O (2002) Mining of moving objects from time-series images and its application to satellite weather imagery. J Intell Inf Syst 19:79–93

    Article  Google Scholar 

  19. Hoppner F (2002) Time series abstraction methods—a survey. In: Tagungsband zur 32. GI Jahrestagung 2002, Workshop on knowledge discovery in databases. Dortmund, Sept/Oct, pp 777–786

  20. Jensen D (2000) Data snooping, dredging and fishing: the dark side of data mining. 1999 SIGKDD panel report. ACM SIGKDD Explor 1:52–54

    Article  Google Scholar 

  21. Jin X, Lu Y, Shi C (2002a) Distribution discovery: local analysis of temporal rules. In: Proceedings of the 6th Pacific-Asia conference on knowledge discovery and data mining. Taipei, Taiwan, 6–8 May, pp 469–480

  22. Jin X, Wang L, Lu Y, Shi C (2002b) Indexing and mining of the local patterns in sequence database. In: Proceedings of the 3rd international conference on intelligent data engineering and automated learning. Manchester, UK, 12–14 Aug, pp 68–73

  23. Kendall M (1976) Time-series, 2nd ed. Griffin, London

  24. Keogh E (2002a) Exact indexing of dynamic time warping. In: Proceedings of the 28th international conference on very large data bases. Hong Kong, 20–23 Aug, pp 406–417

  25. Keogh E (2002b) The UCR time series data mining archive. Computer Science & Engineering Department, University of California, Riverside, CA. http://www.cs.ucr.edu/∼eamonn/TSDMA/index.html

  26. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. J Knowl Inf Syst 3:263–286

    Article  Google Scholar 

  27. Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, Canada, 23–26 July, pp 102–111

  28. Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proceedings of the 3rd IEEE international conference on data mining. Melbourne, FL, 19–22 Nov, pp 115–122

  29. Li C, Yu PS, Castelli V (1998) MALM: a framework for mining sequence database at multiple abstraction levels. In: Proceedings of the 7th ACM International conference on information and knowledge management. Bethesda, MD, 3–7 Nov, pp 267–272

  30. Lin J, Keogh E, Patel P, Lonardi S (2002) Finding motifs in time series. Workshop notes of the 2nd workshop on temporal data mining at the 8th ACM international conference on knowledge discovery and data mining. Edmonton, Alberta, Canada, 23–26 July

  31. Mantegna RN (1999) Hierarchical structure in financial markets. Eur Physical J B 11:193–197

    Article  Google Scholar 

  32. Mori T, Uehara K (2001) Extraction of primitive motion and discovery of association rules from human motion. In: Proceedings of the 10th IEEE international workshop on robot and human communication. Bordeaux-Paris, France, 18–21 Sept, pp 200–206

  33. Nanni L (2003) Personal communication. 22 Apr

  34. Oates T (1999) Identifying distinctive subsequences in multivariate time series by clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining. San Diego, CA, 15–18 Aug, pp 322–326

  35. Ohsaki M, Sato Y, Yokoi H, Yamaguchi T (2002) A rule discovery support system for sequential medical data, in the case study of a chronic hepatitis dataset. Workshop notes of the international workshop on active mining at IEEE international conference on data mining. Maebashi, Japan, 9–12 Dec

  36. Ohsaki M, Sato Y, Yokoi H, Yamaguchi T (2003) A rule discovery support system for sequential medical data, in the case study of a chronic hepatitis dataset. Workshop notes of discovery challenge workshop at the 14th European conference on machine learning/the 7th European conference on principles and practice of knowledge discovery in databases. Cavtat-Dubrovnik, Croatia, 22–26 Sep

  37. Osaki R, Shimada M, Uehara K (2000) A motion recognition method by using primitive motions. In: Arisawa H, Catarci T (eds) Advances in visual information management: visual database systems. Kluwer, pp 117–127

  38. Perlman E, Java A (2003) Predictive mining of time series data. In: Payne HE, Jedrzejewski RI, Hook RN (eds) ASP conference series, vol 295, Astronomical data analysis software and systems XII. San Francisco, pp 431–434

  39. Povinelli R (2003) Personal communication. 19 Sept

  40. Radhakrishnan N, Wilson JD, Loizou PC (2000) An alternative partitioning technique to quantify the regularity of complex time series. Int J Bifur Chaos 10:1773–1779

    Article  Google Scholar 

  41. Reinert, G, Schbath, S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46

    Article  Google Scholar 

  42. Roddick JF, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. Trans Data Eng 14:750–767

    Article  Google Scholar 

  43. Sarker BK, Mori T, Uehara K (2002) Parallel algorithms for mining association rules in time series data. CS24-2002-1, Technical report

  44. Schittenkopf C, Tino P, Dorffner G (2000) The benefit of information reduction for trading strategies. Report series for adaptive information systems and management in economics and management society. July. Report #45

  45. Steinback M, Tan PN, Kumar V, Klooster S, Potter C (2002) Temporal data mining for the discovery and analysis of ocean climate indices. Workshop notes of the 2nd workshop on temporal data mining at the 8th ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, Canada, 23 July

  46. Timmermann A, Sullivan R, White H (1998) The dangers of data-driven inference: the case of calendar effects in stock returns. FMG discussion papers dp0304, Financial Markets Group and ESRC

  47. Tino P, Schittenkopf C, Dorffner G (2000) Temporal pattern recognition in noisy non-stationary time series based on quantization into symbolic streams: lessons learned from financial volatility trading. Report series for adaptive information systems and management in economics and management sciences. July. Report #46

  48. Truppel W, Keogh E, Lin J (2003) A hidden constraint when clustering streaming time series. UCR technical report

  49. Uehara K, Shimada M (2002) Extraction of primitive motion and discovery of association rules from human motion data. Progress in discovery science. Lecture notes in artificial intelligence, vol 2281. Springer, Berlin, Heidelberg, New York, pp 338–348

  50. van Laerhoven K (2001) Combining the Kohonen self-organizing map and K-means for on-line classification of sensor data. In: Dorffner G, Bischof H, Hornik K (eds) Artificial neural networks. Lecture notes in artificial intelligence, vol 2130. Springer, Berlin, Heidelberg, New York, pp 464–470

  51. Walker J (2001) HotBits: genuine random numbers generated by radioactive decay. http://www.fourmilab.ch/hotbits

  52. Yairi Y, Kato Y, Hori K (2001) Fault detection by mining association rules in house-keeping data. In: Proceedings of the 6th international symposium on artificial intelligence, robotics and automation in space. Montreal, Canada, 18–21 Jun

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eamonn Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Keogh, E., Lin, J. Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8, 154–177 (2005). https://doi.org/10.1007/s10115-004-0172-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0172-7

Keywords

Navigation