Abstract
Similarity join on time series databases is an essential operation for data analysis applications. Due to the curse of dimensionality, it is not suitable to use traditional index techniques, such as R-tree and kd-tree. In the paper, a dynamic segment index (i.e., DSTree) is utilized to reduce the huge comparison cost on the similarity join on time series databases. However, the DSTree is designed for similarity search and only supports bound estimations between a time series and a batch of time series in a DSTree node. To make the DSTree suitable for the similarity join on time series databases, it is necessary to have tight bounds for the nodes to achieve a better pruning power, where the biggest challenge is that the DSTree nodes may have different segmentations. To solve the problem aforementioned, a segmentation alignment and synopsis evaluation method is proposed to support the estimation of DSTree nodes to significantly reduce the time cost by pruning unnecessary comparisons. Moreover, to make our approach I/O efficient, a caching strategy is proposed by taking advantage of both graph partitioning and the locality of the DSTree index. The efficiency and effectiveness of the proposed approaches are verified by experiments on real-life datasets.














Similar content being viewed by others
References
Camerra A, Palpanas T, Shieh J, Keogh EJ (2010) isax 2.0: indexing and mining one billion time series. In: ICDM, pp 58–67
Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: SIGKDD
Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: SIGKDD
Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp 459–468
Agrawal R, Psaila G, Wimmers EL, Zaït M (1995) Querying shapes of histories. In: VLDB’95, proceedings of 21th international conference on very large data bases, September 11–15, 1995, Zurich, Switzerland, pp 502–514
Koperski K, Han J (1995) Discovery of spatial association rules in geographic information databases. In: Advances in spatial databases, 4th international symposium, SSD’95, Portland, Maine, USA, August 6–9, 1995, proceedings, pp 47–66
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008, pp 131–140
Kolb L, Thor A, Rahm E (2012) Load balancing for mapreduce-based entity resolution. In: IEEE 28th international conference on data engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012, pp 618–629
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE 2006, 3–8 April 2006, Atlanta, GA, USA, p 5
Kanth K, Agrawal D, Singh A (1998) Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD
Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: FODO
Chan K, Fu A (1999) Efficient time series matching by wavelets. In: ICDE
Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB 2000, proceedings of 26th international conference on very large data bases, September 10–14, 2000, Cairo, Egypt, pp 385–394
Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 151–162
Wang Y, Wang P, Pei J, Wang W, Huang S (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804
Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. In: PVLDB
Böhm C (2000) A cost model for query processing in high dimensional data spaces. ACM Trans Database Syst 25(2):129–178
Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) TOUCH: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp 701–712
Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp 171–182
Kernighan BW, Shen L (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49:291–307
Karger DR (2000) Minimum cuts in near-linear time. J ACM 47:46–76
Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: Proceedings of the 19th design automation conference, DAC ’82, Las Vegas, Nevada, USA, June 14–16, 1982, pp 175–181
http://www.pmel.noaa.gov/tao/data_deliv/. Accessed 6 Apr 2017
http://archive.ics.uci.edu/ml/datasets/. Accessed 6 Apr 2017
Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. In: Proceedings of the thirteenth international conference on data engineering, April 7–11, 1997 Birmingham UK, pp 301–311
Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 379–388
Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: The 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2013, Chicago, IL, USA, August 11–14, 2013, pp 829–837
Funding
Funding was provided by National Key Research and Development Program (Grant Nos. 2016YFE0100300, 2016YFB1000700), National Natural Science Foundation of China (U1509213).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, J., Li, Q., Li, Z. et al. Similarity join on time series by utilizing a dynamic segmentation index. Knowl Inf Syst 61, 1517–1546 (2019). https://doi.org/10.1007/s10115-018-1317-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1317-4