Skip to main content
Log in

Compression-based data mining of sequential data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Allison L, Stern L, Edgoose T, Dix TI (2000) Sequence complexity for biological sequence analysis. Comput Chem 24(1):43–55

    Google Scholar 

  • Baronchelli A, Caglioti E, Loreto V (2005) Artificial sequences and complexity measures. J. Stat. Mech: Theory and Exp, Issue 04, P04002

  • Benedetto D, Caglioti E, Loreto V (2002) Language trees and zipping. Phys Rev Lett 88: 048702

    Article  Google Scholar 

  • Chakrabarti D, Papadimitriou S, Modha D, Faloutsos C (2004) Fully automatic cross-assocations, In: Proceedings of the KDD 2004, Seattle, WA

  • Christen P, Goiser K (2005) Towards automated data linkage and deduplication. Tech Report, Australian National University

    Google Scholar 

  • Cook D, Holder LB (2000) Graph-based data mining. IEEE Intell Syst 15(2):32–41

    Article  Google Scholar 

  • Dasgupta D, Forrest S (1999) Novelty detection in time series data using ideas from immunology. In: Proc. of the international conference on intelligent systems, Heidelberg, Germany

  • Domingos P (1998) A process-oriented heuristic for model selection. In: Machine learning Proc. of the fifteenth international conference,. Morgan Kaufmann Publishers, San Francisco, CA, pp 27–135

  • Elkan, C (2001) Magical thinking in data mining: lessons from CoIL challenge 2000. In Proc. of SIGKDD 2001, San Francisco, CA, USA, pp 426–431

  • Elkan C (2003) Using the triangle inequality to accelerate k-Means. In: Proc. of ICML 2003, Washington DC, USA, pp 147–153

  • Faloutsos C, Lin K (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of 24th ACM SIGMOD, San Jose, CA, USA

  • Farach M, Noordewier M, Savari S, Shepp L, Wyner A, Ziv J (1995) On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. In: Proc. of the symp. on discrete algorithms, San Francisco, CA, USA pp 48-57

  • Ferrandina F, Meyer T, Zicari R (1994) Implementing lazy database updates for an object database system. In: Proc. of the 20 international conference on very large databases, Santiago de Chile, Chile, pp 261–272

  • Flexer A (1996) Statistical evaluation of neural networks experiments: minimum requirements and current practice. In: Proc. of the 13th european meeting on cybernetics and systems research, vol. 2, Austria, pp 1005–1008

  • Frank E, Chui C, Witten I (2000) Text categorization using compression models. In: Proc. of the IEEE data compression conference, Snowbird, Utah, IEEE Comput Soc p555

  • Gaussier E, Goutte C, Popat K, Chen F (2002) A hierarchical model for clustering and categorising documents source lecture notes in computer science; Vol. 2291 archive Proceedings of the 24th BCS-IRSG european colloquium on IR research: advances in information retrieval, Glasgow, UK

  • Gatlin L (1972) Information theory and the living systems. Columbia University Press, columbia

    Google Scholar 

  • Gavrilov M, Anguelov D, Indyk P, Motwahl R (2000) Mining the stock market: which measure is best? In: Proc. of the 6th ACM SIGKDD, 2000, Boston, MA, USA

  • Ge X, Smyth P (2000) Deformable Markov model templates for time-series pattern matching. In: Proc. of the 6th ACM SIGKDD, Boston, MA, pp 81–90

  • Goldberger A.L, Amaral L, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, physioToolkit, and physioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220

    Google Scholar 

  • Kalpakis K, Gada D, Puttagunta V (2001) Distance measures for effective clustering of ARIMA time-series. In: Proceedings of the 1st IEEE ICDM, San Jose, CA, pp 273-280

  • Kennel M (2004) Testing time symmetry in time series using data compression dictionaries. Phys Rev E 69; 056208

  • Keogh E. http://www.cs.ucr.edu/∼eamonn/SIGKDD2004, University of California, Riverside

  • Keogh E, Folias T (2002) The UCR time series data mining archive. University of California, Riverside CA [http://www.cs.ucr.edu/∼eamonn/TSDMA/index.html]

    Google Scholar 

  • Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proc. of SIGKDD, Edmonton, Alberta, Canada

  • Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proc. of the 3rd IEEE ICDM, Melbourne, FL, pp 115–122

  • Kit C (1998) A goodness measure for phrase learning via compression with the MDL principle. In: Kruijff-Korbayova I (ed) The ELLSSI-98 student session, Chapt 13, Saarbrueken, pp 175–187

  • Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149–154

    Article  Google Scholar 

  • Li M, Chen X, Li X, Ma B, Vitanyi, P (2003) The similarity metric. In: Proc. of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Baltimore, MD, USA, pp 863–872

  • Li M, Vitanyi P (1997) An introduction to kolmogorov complexity and its applications, 2nd edn, Springer Verlag, Berlin

  • Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proc. of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, San Diego, CA

  • Loewenstern D, Hirsh H, Yianilos P, Noordewier M (1995) DNA sequence classification using compression-based induction, DIMACS Technical Report 95-04

  • Loewenstern D, Yianilos PN (1999) Significantly lower entropy estimates for natural DNA sequences, J Comput Biol 6(1)

  • Ma J, Perkins S (2003) Online novelty detection on temporal sequences. In: Proc. international conference on knowledge discovery and data mining, Washington, DC

  • Mahoney M, Chan P (2005) Learning rules for time series anomaly detection. SensorMiner Tech report (available at [www.interfacecontrol.com/products/sensorMiner/])

  • Mehta M, Rissanen J, Agrawal R (1995) MDL-based decision tree pruning, In: Proceedings of the first international conference on knowledge discovery and data mining (KDD’95), Montreal, Canada

  • Needham S, Dowe D(2001) Message length as an effective ockham’s razor in decision tree induction, In: Proc. 8th international workshop on AI and statistics, Key West, FL, USA, pp 253–260

  • Ortega A, Beferull-Lozano B, Srinivasamurthy N, Xie H (2000) Compression for recognition and content based retrieval. In: Proc. of the European signal processing conference, EUSIPCO’00, Tampere, Finland

  • Papadimitriou S, Gionis A, Tsaparas P, Väisänen A, Mannila H, Faloutsos C (2005) Parameter-free spatial data mining using MDL, In: Proc of the 5th International Conference on Data Mining (ICDM), Houston, TX, USA

  • Quinlan JR, Rivest RL (1989) Inferring decision trees using the minimum description length principle. Infor Comput 80:227–248

    Article  MATH  MathSciNet  Google Scholar 

  • Ratanamahatana CA, Keogh E (2004) Making time-series classification more accurate using learned constraints. In: Proc. of SIAM international conference on data mining (SDM ’04), Lake Buena Vista, Florida

  • Rissanen J (1978) Modeling by shortest data description. Automatica, 14:465–471

    Article  MATH  Google Scholar 

  • Salzberg SL (1997) On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1(3):317–328

    Article  Google Scholar 

  • Segen J (1990) Graph clustering and model learning by data compression. In: Proc. of the machine learning conference, Austin, TX, USA, pp 93–101

  • Sculley D, Brodley CE (2006) Compression and machine learning: a new perspective on feature space vectors, In: Proceedings of data compression conference, Snowbird, UT, USA, pp 332–341

  • Shahabi C, Tian X, Zhao W (2000) TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries. In: Proc. of the 12th Int’l conference on scientific and statistical database management (SSDBM 2000), Berlin, Germany

  • Teahan WJ, Wen Y, McNab RJ, Witten IH (2000) A compression-based algorithm for Chinese word segmentation. Comput Linguist 26:375–393

    Article  Google Scholar 

  • Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh E (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: Proc. of the 9th ACM SIGKDD, Washington, DC, USA, pp 216–225

  • Wallace C, Boulton (1968) An information measure for classification. Comput J 11 (2):185–194

    MATH  Google Scholar 

  • Yairi T, Kato Y, Hori K (2001) Fault detection by mining association rules from house-keeping data. In: Proc. of Int’l sym. on AI, Robotics and Automation in Space

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eamonn Keogh.

Additional information

Responsible editor: Johannes Gehrke

Rights and permissions

Reprints and permissions

About this article

Cite this article

Keogh, E., Lonardi, S., Ratanamahatana, C.A. et al. Compression-based data mining of sequential data. Data Min Knowl Disc 14, 99–129 (2007). https://doi.org/10.1007/s10618-006-0049-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-006-0049-3

Keywords

Navigation