Similarity join on time series by utilizing a dynamic segmentation index

Wang, Jinhua; Li, Qiuhong; Li, Zhongsheng; Wang, Peng; Wang, Yang; Wang, Wei; Pan, Ningting; Chi, Mingmin

doi:10.1007/s10115-018-1317-4

Similarity join on time series by utilizing a dynamic segmentation index

Regular Paper
Published: 29 January 2019

Volume 61, pages 1517–1546, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jinhua Wang¹,
Qiuhong Li ORCID: orcid.org/0000-0003-3575-960X¹,
Zhongsheng Li²,
Peng Wang¹,
Yang Wang¹,
Wei Wang¹,
Ningting Pan¹ &
…
Mingmin Chi¹

319 Accesses
1 Citation
Explore all metrics

Abstract

Similarity join on time series databases is an essential operation for data analysis applications. Due to the curse of dimensionality, it is not suitable to use traditional index techniques, such as R-tree and kd-tree. In the paper, a dynamic segment index (i.e., DSTree) is utilized to reduce the huge comparison cost on the similarity join on time series databases. However, the DSTree is designed for similarity search and only supports bound estimations between a time series and a batch of time series in a DSTree node. To make the DSTree suitable for the similarity join on time series databases, it is necessary to have tight bounds for the nodes to achieve a better pruning power, where the biggest challenge is that the DSTree nodes may have different segmentations. To solve the problem aforementioned, a segmentation alignment and synopsis evaluation method is proposed to support the estimation of DSTree nodes to significantly reduce the time cost by pruning unnecessary comparisons. Moreover, to make our approach I/O efficient, a caching strategy is proposed by taking advantage of both graph partitioning and the locality of the DSTree index. The efficiency and effectiveness of the proposed approaches are verified by experiments on real-life datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of methods for time series change point detection

Article 08 September 2016

The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Article Open access 18 December 2020

catch22: CAnonical Time-series CHaracteristics

Article Open access 09 August 2019

References

Camerra A, Palpanas T, Shieh J, Keogh EJ (2010) isax 2.0: indexing and mining one billion time series. In: ICDM, pp 58–67
Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: SIGKDD
Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: SIGKDD
Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: FOCS, pp 459–468
Agrawal R, Psaila G, Wimmers EL, Zaït M (1995) Querying shapes of histories. In: VLDB’95, proceedings of 21th international conference on very large data bases, September 11–15, 1995, Zurich, Switzerland, pp 502–514
Koperski K, Han J (1995) Discovery of spatial association rules in geographic information databases. In: Advances in spatial databases, 4th international symposium, SSD’95, Portland, Maine, USA, August 6–9, 1995, proceedings, pp 47–66
Chapter Google Scholar
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008, pp 131–140
Kolb L, Thor A, Rahm E (2012) Load balancing for mapreduce-based entity resolution. In: IEEE 28th international conference on data engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012, pp 618–629
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, ICDE 2006, 3–8 April 2006, Atlanta, GA, USA, p 5
Kanth K, Agrawal D, Singh A (1998) Dimensionality reduction for similarity searching in dynamic databases. In: SIGMOD
Agrawal R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: FODO
Chan K, Fu A (1999) Efficient time series matching by wavelets. In: ICDE
Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: VLDB 2000, proceedings of 26th international conference on very large data bases, September 10–14, 2000, Cairo, Egypt, pp 385–394
Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 151–162
Wang Y, Wang P, Pei J, Wang W, Huang S (2013) A data-adaptive and dynamic segmentation index for whole matching on time series. PVLDB 6(10):793–804
Google Scholar
Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. In: PVLDB
Böhm C (2000) A cost model for query processing in high dimensional data spaces. ACM Trans Database Syst 25(2):129–178
Article Google Scholar
Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) TOUCH: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp 701–712
Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In: Proceedings of the ACM SIGMOD international conference on management of data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp 171–182
Kernighan BW, Shen L (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49:291–307
Article Google Scholar
Karger DR (2000) Minimum cuts in near-linear time. J ACM 47:46–76
Article MathSciNet Google Scholar
Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: Proceedings of the 19th design automation conference, DAC ’82, Las Vegas, Nevada, USA, June 14–16, 1982, pp 175–181
http://www.pmel.noaa.gov/tao/data_deliv/. Accessed 6 Apr 2017
http://archive.ics.uci.edu/ml/datasets/. Accessed 6 Apr 2017
Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. In: Proceedings of the thirteenth international conference on data engineering, April 7–11, 1997 Birmingham UK, pp 301–311
Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp 379–388
Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: The 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2013, Chicago, IL, USA, August 11–14, 2013, pp 829–837

Download references

Funding

Funding was provided by National Key Research and Development Program (Grant Nos. 2016YFE0100300, 2016YFB1000700), National Natural Science Foundation of China (U1509213).

Author information

Authors and Affiliations

Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China
Jinhua Wang, Qiuhong Li, Peng Wang, Yang Wang, Wei Wang, Ningting Pan & Mingmin Chi
JiangNan Institute of Computing Technology, Wuxi, China
Zhongsheng Li

Authors

Jinhua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiuhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ningting Pan
View author publications
You can also search for this author in PubMed Google Scholar
Mingmin Chi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiuhong Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Li, Q., Li, Z. et al. Similarity join on time series by utilizing a dynamic segmentation index. Knowl Inf Syst 61, 1517–1546 (2019). https://doi.org/10.1007/s10115-018-1317-4

Download citation

Received: 23 June 2017
Revised: 15 September 2018
Accepted: 29 November 2018
Published: 29 January 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10115-018-1317-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similarity join on time series by utilizing a dynamic segmentation index

Abstract

Access this article

Similar content being viewed by others

A survey of methods for time series change point detection

The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances

catch22: CAnonical Time-series CHaracteristics

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Similarity join on time series by utilizing a dynamic segmentation index

Abstract

Access this article

Similar content being viewed by others

A survey of methods for time series change point detection

The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances

catch22: CAnonical Time-series CHaracteristics

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation