A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Bagnall, Anthony; Ratanamahatana, Chotirat “Ann”; Keogh, Eamonn; Lonardi, Stefano; Janacek, Gareth

doi:10.1007/s10618-005-0028-0

A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Published: 12 May 2006

Volume 13, pages 11–40, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Anthony Bagnall¹,
Chotirat “Ann” Ratanamahatana²,
Eamonn Keogh³,
Stefano Lonardi³ &
…
Gareth Janacek¹

1063 Accesses
55 Citations
Explore all metrics

Abstract

Clipping is the process of transforming a real valued series into a sequence of bits representing whether each data is above or below the average. In this paper, we argue that clipping is a useful and flexible transformation for the exploratory analysis of large time dependent data sets. We demonstrate how time series stored as bits can be very efficiently compressed and manipulated and that, under some assumptions, the discriminatory power with clipped series is asymptotically equivalent to that achieved with the raw data. Unlike other transformations, clipped series can be compared directly to the raw data series. We show that this means we can form a tight lower bounding metric for Euclidean and Dynamic Time Warping distance and hence efficiently query by content. Clipped data can be used in conjunction with a host of algorithms and statistical tests that naturally follow from the binary nature of the data. A series of experiments illustrate how clipped series can be used in increasingly complex ways to achieve better results than other popular representations. The usefulness of the proposed representation is demonstrated by the fact that the results with clipped data are consistently better than those achieved with a Wavelet or Discrete Fourier Transformation at the same compression ratio for both clustering and query by content. The flexibility of the representation is shown by the fact that we can take advantage of a variable Run Length Encoding of clipped series to define an approximation of the Kolmogorov complexity and hence perform Kolmogorov based clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speeding up dynamic time warping distance for sparse time series data

Article 28 October 2017

Asymptotic Dynamic Time Warping calculation with utilizing value repetition

Article 16 February 2018

Towards Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL

References

Aach, J. and Church, G. 2001. Aligning gene expression time series with time warping algorithms. Bioinformatics, 17:495–508.
Google Scholar
Agrawal, R., Faloutsos, C., and Swami, A.N. 1993. Efficient similarity search in sequence databases. In Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms (FODO).
Austin, J. 1996. Distributed associative memories for high speed symbolic reasoning. International Journal of Fuzzy Sets and Systems, 82:223–233.
Google Scholar
Austin, J., Davis, R., Fletcher, M., Jackson, T., Jessop, M., Liang, B., and Pasley, A. 2005. DAME: Searching large data sets within a grid-enabled engineering application. In Proceedings of the IEEE—Special Issue on Grid Computing, 93(3):496–509.
Austin, J. and Lees, K. 1998. A novel search engine based on correlation matrix memories. In Proceedings of the British Machine Vision Conference.
Austin, J. and Zhou, P. 1998. A binary correlation matrix memory k-nn classifier with hardware implementation. In Proceedings of the British Machine Vision Conference.
Bagnall, A.J. and Janacek, G.J. 2004. Clustering time series from ARMA models with clipped data. In Tenth International Conference on Knowledge Discovery in Data and Data Mining (ACM SIGKDD 2004), pp. 49–58.
Bagnall, A.J. and Janacek, G.J. 2005. Clustering time series with clipped data. Machine Learning, 58(2):151–178.
Google Scholar
Bagnall, A.J., Janacek, G.J., and Powell, M. 2005. A likelihood ratio distance measure for the similarity between the fourier transform of time series. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2005).
Basak, J., Sudarshan, A., Trivedi, D., and Santhanam, M.S. 2004. Weather data mining using independent component analysis. Journal of Machine Learning Research, 5:239–253.
Google Scholar
Berndt, D. and Clifford, J. 1994. Using dynamic time warping to find patterns in time series. In Proceedings of AAAI-94 Workshop on Knowledge Discovery in Databases, pp. 229–248.
Bradley, J.V. 1968. Distribution-Free Statistical Tests. Prentice Hall.
Cai, Y. and Ng, R.T. 2004. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 599–610.
Chan, K.P. and Fu, A.W. 1999. Efficient time series matching with wavelets. In Proceedings of the 15th IEEE Int’l Conference on Data Engineering.
Chiu, B., Keogh, E., and Lonardi, S. 2003. Probabilistic discovery of time series motifs. In Ninth International Conference on Knowledge Discovery in Data and Data Mining (ACM SIGKDD 2003).
Faloutsos, C. Ranganathan, M., and Manolopoulos, Y. 1994. Fast subsequence matching in time-series databases. In Proc. ACM SIGMOD Conference, pp. 419–429.
Fowlkes, E. and Mallows, C. 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78:553–569.
Google Scholar
Galan, R.F., Sachse, S., Galizia, C.G., and Herz, A.V.M. 2004. Odor-driven attractor dynamics in the antennal lobe allow for simple and rapid olfactory pattern classification. Neural Computation, 16(1):999–1012.
Google Scholar
Ganti, V., Gehrke, J., and Ramakrishnan, R. 1999. CACTUS-clustering categorical data using summaries. In Fifth International Conference on Knowledge Discovery in Data and Data Mining (ACM SIGKDD 1999), pp. 73–83.
Ge, X. and Smyth, P. 2000. Deformable markov model templates for time-series pattern matching. In Sixth International Conference on Knowledge Discovery in Data and Data Mining (ACM SIGKDD 2000), pp. 81–90.
Glaz, J. and Balakrishnan, N. 1999. Scan Statistics and Applications. Birkhäuser.
Glaz, J., Naus, J., and Wallenstein, S. 2001. Scan Statistics, Springer.
Golomb, S.W. 1966. Run-length encodings. IEEE Trans. on Information Theory, 12(3):399–401.
Google Scholar
Hodge, V.J. and Austin, J. 2003. An evaluation of standard spell checking algorithms and a binary neural approach. IEEE Transactions on Knowledge and Data Engineering, 15(5).
Huang, Z. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3).
Huffman, D.A. 1952. A method for the construction of minimum-redundancy codes. Inst. Radio Eng., 40:1098–1101.
Google Scholar
Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons.
Kedem, B. 1980. Estimation of the parameters in stationary autoregressive processes after hard limiting. Journal of the American Statistical Association, 75:146–153.
Google Scholar
Keogh, E. 2002. Exact indexing of dynamic time warping. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), pp. 406–417.
Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. 2000. Locally adaptive dimensionality reduction for indexing large time series databases. In Proc. ACM SIGMOD Conference on Management of Data, pp. 151–162.
Keogh, E. and Folias, T. 2002. The UCR time series data mining archive. http://www.cs.ucr. edu/eamonn/TSDMA.
Keogh, E. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery, 7(4).
Keogh, E., Lonardi, S., and Ratanamahatana, C.A. 2004. Towards parameter-free data mining. In Tenth International Conference on Knowledge Discovery in Data and Data Mining (ACM SIGKDD 2004), pp. 206 – 215.
Keogh, E. and Pazzani, M. 2000. A simple dimensionality reduction technique for fast similarity search in large time series databases. In 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2000.
Knuth, D.E. 1985. Dynamic huffman coding. J. of Algorithms, 6(2):163–180.
Google Scholar
Korn, F., Jagadish, H., and Faloutsos, C. 1997. Efficiently supporting ad hoc queries in large data sets of time sequences. In Proceedings of the ACM SIGMOD Int’l Conference on Management of Data.
Li, M., Chen, X., Li, X., Ma, B., and Vitanyi, P. 2003. The similarity metric. In Proceedings of the 14th annual ACM-SIAM Symposium on Discrete Algorithms, pp. 863–872.
Lin, J., Keogh, E., Lonardi, S., and Chiu, B. 2003. A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
Milligan, G.W., Sokol, L.M., and Soon, S.C. 1983. The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure. IEEE Trans. PAMI, 5(1):40–47.
Google Scholar
Morchen, F. 2003. Time series feature extraction for data mining using DWT and DFT. Technical Report 3, Departement of Mathematics and Computer Science Philipps-University Marburg.
Morinaka, Y., Yoshikawa, M., Amagasa, T., and Uemura, S. 2001. The L-index: An indexing structure for efficient subsequence matching in time sequence databases. In Proceedings of 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2001).
Ordonez, C. 2003. Clustering binary data streams with k-means. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 12–19.
Rand, W.M. 1971. Objective criterion for evaluation of clustering methods. Journal of American Statistical Association, 66:846–851.
Google Scholar
Ratanamahatana, C.A. and Keogh, E. 2005. Three myths about dynamic time warping data mining. In Proceedings of SIAM International Conference on Data Mining (SDM ’05).
Rayner, J.C. and Best, D.J. (eds.) 2001. A Contingency Table Approach to Non-parametric Testing. Chapman and Hall.
Rice, S.O. 1944. Mathematical analysis of random noise. Bell Syst. Tech. J., 23:292–332.
Google Scholar
Rissanen, J. and Langdon, G.G. 1979. Arithmetic coding. IBM J. of Res. and Dev., 23(2):149–162.
Google Scholar
Schwarz, E.S. 1964. An optimum encoding with minimum longest code and total number of digits. Inf. and Control, 7:37–44.
Google Scholar
Vlachos, M., Yu, P., and Castelli, V. 2005. On periodicity detection and structural periodic similarity. In Proceedings of the Siam International conference on Data Mining (SDM 05).
Weld, D.S. and de Kleer, J. (eds.) 1990. Readings in qualitative reasoning about physical systems. Morgan Kaufmann Publishers Inc.,
Xiong, Y. and Yeung, D.Y. 2002. Mixtures of ARMA models for model-based time series clustering. In IEEE International Conference on Data Mining (ICDM’02).
Yi, B.K. and Faloutsos, C. 2000. Fast time sequence indexing for arbitrary Lp norms. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB), pp. 385–394.
Zhu, Y. and Shasha, D. 2003. Warping indexes with envelope transforms for query by humming. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 181–192.

Download references

Author information

Authors and Affiliations

School of Computing Sciences, University of East Anglia, Norwich, UK
Anthony Bagnall & Gareth Janacek
Department of Computer Engineering, Chulalongkorn University, Chulalongkorn, Thailand
Chotirat “Ann” Ratanamahatana
Department of Computer Science and Engineering, University of California, Riverside, USA
Eamonn Keogh & Stefano Lonardi

Authors

Anthony Bagnall
View author publications
You can also search for this author in PubMed Google Scholar
Chotirat “Ann” Ratanamahatana
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Lonardi
View author publications
You can also search for this author in PubMed Google Scholar
Gareth Janacek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anthony Bagnall.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bagnall, A., Ratanamahatana, C.“., Keogh, E. et al. A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. Data Min Knowl Disc 13, 11–40 (2006). https://doi.org/10.1007/s10618-005-0028-0

Download citation

Received: 12 May 2005
Accepted: 31 October 2005
Published: 12 May 2006
Issue Date: July 2006
DOI: https://doi.org/10.1007/s10618-005-0028-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Abstract

Access this article

Similar content being viewed by others

Speeding up dynamic time warping distance for sparse time series data

Asymptotic Dynamic Time Warping calculation with utilizing value repetition

Towards Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Abstract

Access this article

Similar content being viewed by others

Speeding up dynamic time warping distance for sparse time series data

Asymptotic Dynamic Time Warping calculation with utilizing value repetition

Towards Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation