Skip to main content
Log in

Beyond one billion time series: indexing and mining very large time series collections with \(i\)SAX2+

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series. In this paper, we describe \(i\)SAX 2.0 and its improvements, \(i\)SAX 2.0 Clustered and \(i\)SAX2+, three methods designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our methods allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. This temporary storage on disk refers to storing the raw time series data for the period between the time when the time series is processed in order to be indexed, and the time when the raw time series has to be moved to the correct leaf node disk page.

  2. Xylem is plant sap responsible for the transport of water and soluble mineral nutrients from the roots throughout the plant.

References

  1. ADHD (2012) http://www.fcon_1000.projects.nitrc.org/indi/adhd200/

  2. An N, Kanth R, Kothuri V, Ravada S (2003) Improving performance with bulk-inserts in Oracle R-trees. VLDB, pp 948–951

  3. Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete. Wired 16(7). http://www.wired.com//science/discoveries/magazine/16-07/pb_theory

  4. Andersen P, Brodbeck B, Mizell R (2009) Assimilation efficiency of free and protein amino acids by Homalodisca vitripennis (Hemiptera: Cicadellidae: Cicadellinae) feeding on Citrus sinensis and Vitis vinifera. In: Andersen PC, Brodbeck BV, Mizell RF (eds) Florida entomologist, vol. 92, issue 1, pp 116–122.

  5. Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-tree: efficient time series search and retrieval. In: Proceedings of the 11th international conference on extending database technology: advances in database technology (EDBT ‘08). ACM, New York, NY, pp 252–263

  6. Backus E, Bennett W (2009) The AC-DC correlation monitor: new EPG design with flexible input resistors to detect both R and emf components for any piercing-sucking hemipteran. J Insect Physiol 55(10):869–884

    Article  Google Scholar 

  7. Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proc SIGMOD

  8. Camerra A, Palpanas T, Shieh J, Keogh EJ (2010) iSAX 2.0 (2010). Indexing and mining one billion time series. ICDM, pp 58–67

  9. Castro N, Azevedo PJ (2010) Multiresolution motif discovery in time series. SDM, pp 665–667

  10. Castro N, Azevedo PJ (2011) Time series motifs statistical significance. Proceedings of the eleventh SIAM international conference on data mining

  11. Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228

    Article  Google Scholar 

  12. Choubey R, Chen L, Rundensteiner EA (1999) GBI: a generalized R-tree bulk-insertion strategy. SSD, pp 91–108

  13. Dallachiesa M, Nushi B, Mirylenka K, Palpanas T (2012) Uncertain time-series similarity: return to the basics. Proc VLDB Endow (PVLDB) J 5(11):1662–1673

    Google Scholar 

  14. Data (2012) http://www.disi.unitn.it/themis/isax2plus/

  15. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2):1542–1552

    Google Scholar 

  16. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Snodgrass RT, Winslett M (eds) Proceedings of the 1994 ACM SIGMOD international conference on managment of data (SIGMOD ‘94). ACM, New York, NY, pp 419–429

  17. Greg W (2009) Personal communication. August 12th

  18. Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2):8–12

    Article  Google Scholar 

  19. Kindt F, Joosten NN, Peters D, Tjallingii WF (2003) Characterisation of the feeding behaviour of western flower thrips in terms of EPG waveforms. J Insect Physiol 49:183–191

    Article  Google Scholar 

  20. Rogers J et al (2006) An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci. Genomics 87:30–38

    Article  Google Scholar 

  21. Shieh J, Keogh E (2008) iSAX: indexing and mining terabyte sized time series. In: ACM SIGKDD

  22. Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE PAMI 30(11):1958–1970

    Article  Google Scholar 

  23. Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001a) Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD

  24. Keogh EJ, Chakrabarti K, Pazzani MJ, Mehrotra S (2001b) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3(3):263–286

    Article  MATH  Google Scholar 

  25. Keogh E, Palpanas T, Jordan VB, Gunopulos D, Cardle M (2004) Indexing large human-motion databases. VLDB, Toronto, ON, Canada, August

    Google Scholar 

  26. Keogh EJ, Smyth P (1997) A probabilistic approach to fast pattern matching in time series databases. In: Proceedings of the third international conference on knowledge discovery and data mining (KDD-97), Newport Beach, California, pp 24–30

  27. Kohlsdorf D, Starner T, Ashbrook D (2011) MAGIC 2.0: a web tool for false positive prediction and prevention for gesture recognition systems. In: FG’ 11

  28. Lars A, Klaus H, Vahrenhold J (2002) Efficient bulk operations on dynamic R-trees. Algorithmica 33(1):104–128

    Article  MathSciNet  Google Scholar 

  29. Lin J, Keogh EJ, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144

    Article  MathSciNet  Google Scholar 

  30. Marascu A, Khan SA, Palpanas T (2012) Scalable similarity matching in streaming time series. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Kuala Lumpur, Malaysia, May

  31. Palpanas T, Vlachos M, Keogh EJ, Gunopulos D (2008) Streaming time series summarization using user-defined amnesic functions. IEEE Trans Knowl Data Eng 20(7):992–1006

    Article  Google Scholar 

  32. Popivanov I, Miller RJ (2002) Similarity search over time-series data using wavelets. In: Proceedings of the 18th international conference on data engineering, pp 212–221

  33. Soisalon-Soininen E, Widmayer P (2003) Single and bulk updates in stratified trees: an amortized and worst-case analysis. Comput Sci Perspect, pp 278–292

  34. TSST (2012) http://www.usvao.org/science-tools-services/time-series-search-tool/

  35. Wu Y-L, Agrawal D, Abbadi AE (2000) A comparison of DFT and DWT based similarity search in time-series databases. In: Proceedings of the 9th international conference on information and knowledge management (CIKM ‘00). ACM, New York, NY, pp 488–495

  36. Van den Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. VLDB, pp 461–470

  37. Van den Bercken J, Seeger B, Widmayer P (1997) A generic approach to bulk loading multidimensional index structures. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) Proceedings of the 23rd international conference on very large data bases (VLDB ‘97). Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 406–415

  38. Zoumpatianos K, Palpanas T, Mylopoulos J (2012) Strategic management for real-time business intelligence. In: International workshop on business intelligence for the real, time enterprise (BIRTE)

Download references

Acknowledgments

This research was funded by NSF awards 0803410 and 0808770.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Themis Palpanas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Camerra, A., Shieh, J., Palpanas, T. et al. Beyond one billion time series: indexing and mining very large time series collections with \(i\)SAX2+. Knowl Inf Syst 39, 123–151 (2014). https://doi.org/10.1007/s10115-012-0606-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0606-6

Keywords

Navigation