Beyond one billion time series: indexing and mining very large time series collections with $$i$$ SAX2+

Camerra, Alessandro; Shieh, Jin; Palpanas, Themis; Rakthanmanon, Thanawin; Keogh, Eamonn

doi:10.1007/s10115-012-0606-6

Beyond one billion time series: indexing and mining very large time series collections with $i$SAX2+

Regular Paper
Published: 16 February 2013

Volume 39, pages 123–151, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Alessandro Camerra¹,
Jin Shieh²,
Themis Palpanas¹,
Thanawin Rakthanmanon³ &
…
Eamonn Keogh²

1338 Accesses
71 Citations
Explore all metrics

Abstract

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series. In this paper, we describe $i$SAX 2.0 and its improvements, $i$SAX 2.0 Clustered and $i$SAX2+, three methods designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our methods allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big Sequence Management: A glimpse of the Past, the Present, and the Future

Evolution of a Data Series Index

ADS: the adaptive data series index

Article 31 August 2016

Notes

This temporary storage on disk refers to storing the raw time series data for the period between the time when the time series is processed in order to be indexed, and the time when the raw time series has to be moved to the correct leaf node disk page.
Xylem is plant sap responsible for the transport of water and soluble mineral nutrients from the roots throughout the plant.

References

ADHD (2012) http://www.fcon_1000.projects.nitrc.org/indi/adhd200/
An N, Kanth R, Kothuri V, Ravada S (2003) Improving performance with bulk-inserts in Oracle R-trees. VLDB, pp 948–951
Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete. Wired 16(7). http://www.wired.com//science/discoveries/magazine/16-07/pb_theory
Andersen P, Brodbeck B, Mizell R (2009) Assimilation efficiency of free and protein amino acids by Homalodisca vitripennis (Hemiptera: Cicadellidae: Cicadellinae) feeding on Citrus sinensis and Vitis vinifera. In: Andersen PC, Brodbeck BV, Mizell RF (eds) Florida entomologist, vol. 92, issue 1, pp 116–122.
Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-tree: efficient time series search and retrieval. In: Proceedings of the 11th international conference on extending database technology: advances in database technology (EDBT ‘08). ACM, New York, NY, pp 252–263
Backus E, Bennett W (2009) The AC-DC correlation monitor: new EPG design with flexible input resistors to detect both R and emf components for any piercing-sucking hemipteran. J Insect Physiol 55(10):869–884
Article Google Scholar
Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proc SIGMOD
Camerra A, Palpanas T, Shieh J, Keogh EJ (2010) iSAX 2.0 (2010). Indexing and mining one billion time series. ICDM, pp 58–67
Castro N, Azevedo PJ (2010) Multiresolution motif discovery in time series. SDM, pp 665–667
Castro N, Azevedo PJ (2011) Time series motifs statistical significance. Proceedings of the eleventh SIAM international conference on data mining
Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228
Article Google Scholar
Choubey R, Chen L, Rundensteiner EA (1999) GBI: a generalized R-tree bulk-insertion strategy. SSD, pp 91–108
Dallachiesa M, Nushi B, Mirylenka K, Palpanas T (2012) Uncertain time-series similarity: return to the basics. Proc VLDB Endow (PVLDB) J 5(11):1662–1673
Google Scholar
Data (2012) http://www.disi.unitn.it/themis/isax2plus/
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2):1542–1552
Google Scholar
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Snodgrass RT, Winslett M (eds) Proceedings of the 1994 ACM SIGMOD international conference on managment of data (SIGMOD ‘94). ACM, New York, NY, pp 419–429
Greg W (2009) Personal communication. August 12th
Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2):8–12
Article Google Scholar
Kindt F, Joosten NN, Peters D, Tjallingii WF (2003) Characterisation of the feeding behaviour of western flower thrips in terms of EPG waveforms. J Insect Physiol 49:183–191
Article Google Scholar
Rogers J et al (2006) An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci. Genomics 87:30–38
Article Google Scholar
Shieh J, Keogh E (2008) iSAX: indexing and mining terabyte sized time series. In: ACM SIGKDD
Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE PAMI 30(11):1958–1970
Article Google Scholar
Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001a) Locally adaptive dimensionality reduction for indexing large time series databases. In: SIGMOD
Keogh EJ, Chakrabarti K, Pazzani MJ, Mehrotra S (2001b) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3(3):263–286
Article MATH Google Scholar
Keogh E, Palpanas T, Jordan VB, Gunopulos D, Cardle M (2004) Indexing large human-motion databases. VLDB, Toronto, ON, Canada, August
Google Scholar
Keogh EJ, Smyth P (1997) A probabilistic approach to fast pattern matching in time series databases. In: Proceedings of the third international conference on knowledge discovery and data mining (KDD-97), Newport Beach, California, pp 24–30
Kohlsdorf D, Starner T, Ashbrook D (2011) MAGIC 2.0: a web tool for false positive prediction and prevention for gesture recognition systems. In: FG’ 11
Lars A, Klaus H, Vahrenhold J (2002) Efficient bulk operations on dynamic R-trees. Algorithmica 33(1):104–128
Article MathSciNet Google Scholar
Lin J, Keogh EJ, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144
Article MathSciNet Google Scholar
Marascu A, Khan SA, Palpanas T (2012) Scalable similarity matching in streaming time series. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Kuala Lumpur, Malaysia, May
Palpanas T, Vlachos M, Keogh EJ, Gunopulos D (2008) Streaming time series summarization using user-defined amnesic functions. IEEE Trans Knowl Data Eng 20(7):992–1006
Article Google Scholar
Popivanov I, Miller RJ (2002) Similarity search over time-series data using wavelets. In: Proceedings of the 18th international conference on data engineering, pp 212–221
Soisalon-Soininen E, Widmayer P (2003) Single and bulk updates in stratified trees: an amortized and worst-case analysis. Comput Sci Perspect, pp 278–292
TSST (2012) http://www.usvao.org/science-tools-services/time-series-search-tool/
Wu Y-L, Agrawal D, Abbadi AE (2000) A comparison of DFT and DWT based similarity search in time-series databases. In: Proceedings of the 9th international conference on information and knowledge management (CIKM ‘00). ACM, New York, NY, pp 488–495
Van den Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. VLDB, pp 461–470
Van den Bercken J, Seeger B, Widmayer P (1997) A generic approach to bulk loading multidimensional index structures. In: Jarke M, Carey MJ, Dittrich KR, Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds) Proceedings of the 23rd international conference on very large data bases (VLDB ‘97). Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 406–415
Zoumpatianos K, Palpanas T, Mylopoulos J (2012) Strategic management for real-time business intelligence. In: International workshop on business intelligence for the real, time enterprise (BIRTE)

Download references

Acknowledgments

This research was funded by NSF awards 0803410 and 0808770.

Author information

Authors and Affiliations

University of Trento, Trento, Italy
Alessandro Camerra & Themis Palpanas
University of California, Riverside, CA, USA
Jin Shieh & Eamonn Keogh
Kasetsart University, Bangkok, Thailand
Thanawin Rakthanmanon

Authors

Alessandro Camerra
View author publications
You can also search for this author inPubMed Google Scholar
Jin Shieh
View author publications
You can also search for this author inPubMed Google Scholar
Themis Palpanas
View author publications
You can also search for this author inPubMed Google Scholar
Thanawin Rakthanmanon
View author publications
You can also search for this author inPubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Themis Palpanas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Camerra, A., Shieh, J., Palpanas, T. et al. Beyond one billion time series: indexing and mining very large time series collections with $i$SAX2+. Knowl Inf Syst 39, 123–151 (2014). https://doi.org/10.1007/s10115-012-0606-6

Download citation

Received: 23 March 2012
Revised: 23 September 2012
Accepted: 28 December 2012
Published: 16 February 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s10115-012-0606-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond one billion time series: indexing and mining very large time series collections with \(i\)SAX2+

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Big Sequence Management: A glimpse of the Past, the Present, and the Future

Evolution of a Data Series Index

ADS: the adaptive data series index

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Beyond one billion time series: indexing and mining very large time series collections with \(i\)SAX2+

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Big Sequence Management: A glimpse of the Past, the Present, and the Future

Evolution of a Data Series Index

ADS: the adaptive data series index

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now