Compression-based data mining of sequential data

Keogh, Eamonn; Lonardi, Stefano; Ratanamahatana, Chotirat Ann; Wei, Li; Lee, Sang-Hee; Handley, John

doi:10.1007/s10618-006-0049-3

Compression-based data mining of sequential data

Published: 26 January 2007

Volume 14, pages 99–129, (2007)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Eamonn Keogh¹,
Stefano Lonardi¹,
Chotirat Ann Ratanamahatana²,
Li Wei¹,
Sang-Hee Lee³ &
…
John Handley⁴

1085 Accesses
65 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Allison L, Stern L, Edgoose T, Dix TI (2000) Sequence complexity for biological sequence analysis. Comput Chem 24(1):43–55
Google Scholar
Baronchelli A, Caglioti E, Loreto V (2005) Artificial sequences and complexity measures. J. Stat. Mech: Theory and Exp, Issue 04, P04002
Benedetto D, Caglioti E, Loreto V (2002) Language trees and zipping. Phys Rev Lett 88: 048702
Article Google Scholar
Chakrabarti D, Papadimitriou S, Modha D, Faloutsos C (2004) Fully automatic cross-assocations, In: Proceedings of the KDD 2004, Seattle, WA
Christen P, Goiser K (2005) Towards automated data linkage and deduplication. Tech Report, Australian National University
Google Scholar
Cook D, Holder LB (2000) Graph-based data mining. IEEE Intell Syst 15(2):32–41
Article Google Scholar
Dasgupta D, Forrest S (1999) Novelty detection in time series data using ideas from immunology. In: Proc. of the international conference on intelligent systems, Heidelberg, Germany
Domingos P (1998) A process-oriented heuristic for model selection. In: Machine learning Proc. of the fifteenth international conference,. Morgan Kaufmann Publishers, San Francisco, CA, pp 27–135
Elkan, C (2001) Magical thinking in data mining: lessons from CoIL challenge 2000. In Proc. of SIGKDD 2001, San Francisco, CA, USA, pp 426–431
Elkan C (2003) Using the triangle inequality to accelerate k-Means. In: Proc. of ICML 2003, Washington DC, USA, pp 147–153
Faloutsos C, Lin K (1995) FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. of 24th ACM SIGMOD, San Jose, CA, USA
Farach M, Noordewier M, Savari S, Shepp L, Wyner A, Ziv J (1995) On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. In: Proc. of the symp. on discrete algorithms, San Francisco, CA, USA pp 48-57
Ferrandina F, Meyer T, Zicari R (1994) Implementing lazy database updates for an object database system. In: Proc. of the 20 international conference on very large databases, Santiago de Chile, Chile, pp 261–272
Flexer A (1996) Statistical evaluation of neural networks experiments: minimum requirements and current practice. In: Proc. of the 13th european meeting on cybernetics and systems research, vol. 2, Austria, pp 1005–1008
Frank E, Chui C, Witten I (2000) Text categorization using compression models. In: Proc. of the IEEE data compression conference, Snowbird, Utah, IEEE Comput Soc p555
Gaussier E, Goutte C, Popat K, Chen F (2002) A hierarchical model for clustering and categorising documents source lecture notes in computer science; Vol. 2291 archive Proceedings of the 24th BCS-IRSG european colloquium on IR research: advances in information retrieval, Glasgow, UK
Gatlin L (1972) Information theory and the living systems. Columbia University Press, columbia
Google Scholar
Gavrilov M, Anguelov D, Indyk P, Motwahl R (2000) Mining the stock market: which measure is best? In: Proc. of the 6th ACM SIGKDD, 2000, Boston, MA, USA
Ge X, Smyth P (2000) Deformable Markov model templates for time-series pattern matching. In: Proc. of the 6th ACM SIGKDD, Boston, MA, pp 81–90
Goldberger A.L, Amaral L, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, physioToolkit, and physioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220
Google Scholar
Kalpakis K, Gada D, Puttagunta V (2001) Distance measures for effective clustering of ARIMA time-series. In: Proceedings of the 1st IEEE ICDM, San Jose, CA, pp 273-280
Kennel M (2004) Testing time symmetry in time series using data compression dictionaries. Phys Rev E 69; 056208
Keogh E. http://www.cs.ucr.edu/∼eamonn/SIGKDD2004, University of California, Riverside
Keogh E, Folias T (2002) The UCR time series data mining archive. University of California, Riverside CA [http://www.cs.ucr.edu/∼eamonn/TSDMA/index.html]
Google Scholar
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proc. of SIGKDD, Edmonton, Alberta, Canada
Keogh E, Lin J, Truppel W (2003) Clustering of time series subsequences is meaningless: implications for past and future research. In: Proc. of the 3rd IEEE ICDM, Melbourne, FL, pp 115–122
Kit C (1998) A goodness measure for phrase learning via compression with the MDL principle. In: Kruijff-Korbayova I (ed) The ELLSSI-98 student session, Chapt 13, Saarbrueken, pp 175–187
Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H (2001) An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149–154
Article Google Scholar
Li M, Chen X, Li X, Ma B, Vitanyi, P (2003) The similarity metric. In: Proc. of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Baltimore, MD, USA, pp 863–872
Li M, Vitanyi P (1997) An introduction to kolmogorov complexity and its applications, 2nd edn, Springer Verlag, Berlin
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proc. of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, San Diego, CA
Loewenstern D, Hirsh H, Yianilos P, Noordewier M (1995) DNA sequence classification using compression-based induction, DIMACS Technical Report 95-04
Loewenstern D, Yianilos PN (1999) Significantly lower entropy estimates for natural DNA sequences, J Comput Biol 6(1)
Ma J, Perkins S (2003) Online novelty detection on temporal sequences. In: Proc. international conference on knowledge discovery and data mining, Washington, DC
Mahoney M, Chan P (2005) Learning rules for time series anomaly detection. SensorMiner Tech report (available at [www.interfacecontrol.com/products/sensorMiner/])
Mehta M, Rissanen J, Agrawal R (1995) MDL-based decision tree pruning, In: Proceedings of the first international conference on knowledge discovery and data mining (KDD’95), Montreal, Canada
Needham S, Dowe D(2001) Message length as an effective ockham’s razor in decision tree induction, In: Proc. 8th international workshop on AI and statistics, Key West, FL, USA, pp 253–260
Ortega A, Beferull-Lozano B, Srinivasamurthy N, Xie H (2000) Compression for recognition and content based retrieval. In: Proc. of the European signal processing conference, EUSIPCO’00, Tampere, Finland
Papadimitriou S, Gionis A, Tsaparas P, Väisänen A, Mannila H, Faloutsos C (2005) Parameter-free spatial data mining using MDL, In: Proc of the 5th International Conference on Data Mining (ICDM), Houston, TX, USA
Quinlan JR, Rivest RL (1989) Inferring decision trees using the minimum description length principle. Infor Comput 80:227–248
Article MATH MathSciNet Google Scholar
Ratanamahatana CA, Keogh E (2004) Making time-series classification more accurate using learned constraints. In: Proc. of SIAM international conference on data mining (SDM ’04), Lake Buena Vista, Florida
Rissanen J (1978) Modeling by shortest data description. Automatica, 14:465–471
Article MATH Google Scholar
Salzberg SL (1997) On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1(3):317–328
Article Google Scholar
Segen J (1990) Graph clustering and model learning by data compression. In: Proc. of the machine learning conference, Austin, TX, USA, pp 93–101
Sculley D, Brodley CE (2006) Compression and machine learning: a new perspective on feature space vectors, In: Proceedings of data compression conference, Snowbird, UT, USA, pp 332–341
Shahabi C, Tian X, Zhao W (2000) TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries. In: Proc. of the 12th Int’l conference on scientific and statistical database management (SSDBM 2000), Berlin, Germany
Teahan WJ, Wen Y, McNab RJ, Witten IH (2000) A compression-based algorithm for Chinese word segmentation. Comput Linguist 26:375–393
Article Google Scholar
Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh E (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: Proc. of the 9th ACM SIGKDD, Washington, DC, USA, pp 216–225
Wallace C, Boulton (1968) An information measure for classification. Comput J 11 (2):185–194
MATH Google Scholar
Yairi T, Kato Y, Hori K (2001) Fault detection by mining association rules from house-keeping data. In: Proc. of Int’l sym. on AI, Robotics and Automation in Space

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA
Eamonn Keogh, Stefano Lonardi & Li Wei
Department of Computer Engineering, Chulalongkorn University, Bangkok, Thailand
Chotirat Ann Ratanamahatana
Department of Anthropology, University of California, Riverside, CA, 92521, USA
Sang-Hee Lee
Xerox Innovation Group, Xerox Corporation, 800 Phillips Road, MS 128-25E, Webster, New York, 14580-9701, USA
John Handley

Authors

Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Lonardi
View author publications
You can also search for this author in PubMed Google Scholar
Chotirat Ann Ratanamahatana
View author publications
You can also search for this author in PubMed Google Scholar
Li Wei
View author publications
You can also search for this author in PubMed Google Scholar
Sang-Hee Lee
View author publications
You can also search for this author in PubMed Google Scholar
John Handley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eamonn Keogh.

Additional information

Responsible editor: Johannes Gehrke

Rights and permissions

Reprints and permissions

About this article

Cite this article

Keogh, E., Lonardi, S., Ratanamahatana, C.A. et al. Compression-based data mining of sequential data. Data Min Knowl Disc 14, 99–129 (2007). https://doi.org/10.1007/s10618-006-0049-3

Download citation

Received: 15 December 2005
Accepted: 02 May 2006
Published: 26 January 2007
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10618-006-0049-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Compression-based data mining of sequential data

Abstract

Access this article

Similar content being viewed by others

Mining and Using Sets of Patterns through Compression

Omen: discovering sequential patterns with reliable prediction delays

Introducing the contrast profile: a novel time series primitive that allows real world classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Compression-based data mining of sequential data

Abstract

Access this article

Similar content being viewed by others

Mining and Using Sets of Patterns through Compression

Omen: discovering sequential patterns with reliable prediction delays

Introducing the contrast profile: a novel time series primitive that allows real world classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation