Abstract
Data series approximate similarity search is a basic building block operation essential for almost all analytical tasks. To speed up this important operation, the prevalent approach is to construct indexes directly on the data series objects. This suffers from very high construction time and storage cost due to the inherent complexity of indexing these high-dimensional data series objects. We instead design a promising new approach that leverages the unique property of correlations between the high-dimensional data series objects and the (often simple) partitioning attribute(s) in distributed data series repositories. Our proposed infrastructure, called PARROT, discovers, assesses, and exploits such correlations for similarity query optimization. PARROT addresses several critical challenges including the high dimensionality of the data series objects, softness (uncertainty) of correlation, correlation granularity, and lack of a proper measure for assessing correlation strength in big data series. We present scalable solutions tackling each of these challenges including pattern-level indexing, exception handling strategies for soft correlations, and a new entropy-based measure for assessing the correlation strength and judging their potential effectiveness. The PARROT query engine efficiently supports approximate kNN similarity queries leveraging the PARROT index. PARROT prototype is implemented on Apache Spark. Extensive experiments on real and synthetic datasets demonstrate that PARROT has substantially lower index construction costs, smaller storage overhead, and better performance and accuracy for processing similarity queries compared to alternate state-of-the-art solutions.
















Similar content being viewed by others
Notes
A sorted list of all entries from the global index is not necessary. Since we only need a subset, a min-heap is used to incrementally keep or purge entries.
Mean Average Precision (MAP) is a widely used accuracy measure for centralized systems, which captures and compares the order of the items in the answer sets. However, it is not explicitly reported here since in distributed platforms there is no notion of order because the answer is generated in a distributed fashion. Thus, the final results are globally sorted. In this case, MAP becomes equivalent to Recall.
References
Apache hive (2020). https://hive.apache.org/
U.S. Geological Survey, gross primary productivity (2020). https://lpdaac.usgs.gov/products/mod17a2hv006/
Alghamdi, N., Zhang, L., Eltabakh, M.Y., Rundensteiner, E.A.: Chainlink: indexing big time series data for long subsequence matching. In: ICDE, pp. 529–540. IEEE (2020)
Alghamdi, N.S., Zhang, L., Rundensteiner, E.A., Eltabakh, M.Y.: Scalable time series compound infrastructure. In: SIGMOD, pp. 1685–1698. ACM (2022)
Aljawarneh, S., Radhakrishna, V., Kumar, P.V., Janaki, V.: A similarity measure for temporal pattern discovery in time series data generated by IoT. In: ICEMIS, pp. 1–4. IEEE (2016)
Arora, A., Sinha, S., Kumar, P., Bhattacharya, A.: Hd-index: pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. PVLDB 11(8), 906–919 (2018)
Aucouturier, J.J., Pachet, F., et al.: Music similarity measures: What’s the use? In: ISMIR, pp. 13–17 (2002)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755. IEEE (2007)
Brown, P.G., Haas, P.J.: Bhunt: Automatic discovery of fuzzy algebraic constraints in relational data. In: PVLDB. Elsevier (2003)
Camerra, A., Palpanas, T., Shieh, J., Keogh, E.: iSAX 2.0: Indexing and mining one billion time series. In: ICDE. IEEE (2010)
Carrington, P.J., Scott, J., Wasserman, S.: Models and Methods in Social Network Analysis. Cambridge University Press, Cambridge (2005)
Chan, N.H.: Time Series: Applications to Finance, vol. 487. Wiley, London (2004)
Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)
Claesen, M., De Moor, B.: Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127 (2015)
Cook, A.A., Mısırlı, G., Fan, Z.: Anomaly detection for IoT time-series data: a survey. In: Internet of Things Journal, 7, pp. 6481–6494. IEEE (2019)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 2, 107–113 (2008)
Ebrahimi, N., Soofi, E.S., Soyer, R.: Information measures in perspective. Int. Stat. Rev. 5, 6266 (2010)
Echihabi, K., Zoumpatianos, K., Palpanas, T., Benbrahim, H.: Return of the lernaean hydra: experimental evaluation of data series approximate similarity search. PVLDB 13(3), 403–420 (2019)
Eltabakh, M.Y.: Big data indexing. In: Encyclopedia of Big Data Technologies (2019)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence match in time-series databases. In: SIGMOD, vol. 23. ACM (1994)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. In: TODS, vol. 33, pp. 1–48. ACM (2008)
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Vector approximation based indexing for non-uniform high dimensional data sets. In: CIKM, pp. 202–209 (2000)
Feurer, M., Hutter, F.: Hyperparameter optimization. In: Automated Machine Learning, pp. 3–33. Springer (2019)
Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision, architectural elements, and future directions. Futur. Gener. Comput. Syst. 29(7), 1645–1660 (2013)
Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD, pp. 647–658. ACM (2004)
Kashyap, S., Karras, P.: Scalable knn search on vertically stored time series. In: SIGKDD, pp. 1334–1342. ACM (2011)
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Dimensionality reduction for fast similarity search in large time series databases. In: KAIS, vol. 3, pp. 263–286. Springer (2001)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. In: PVLDB, pp. 1222–1233 (2009)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Coradd: Correlation aware database designer for materialized views and indexes. In: PVLDB, vol. 3, pp. 1103–1113 (2010)
Kondylakis, H., Dayan, N., Zoumpatianos, K., Palpanas, T.: Coconut: A scalable bottom-up approach for building data series indexes. In: PVLDB, vol. 11, pp. 677–690 (2018)
Linardi, M., Palpanas, T.: Scalable, variable-length similarity search in data series: The ulisse approach. In: PVLDB, pp. 2236–2248 (2018)
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. In: PVLDB, 12, pp. 1005–1016 (2016)
Liu, Y., Liu, H., Xiao, D., Eltabakh, M.Y.: Adaptive correlation exploitation in big data query optimization. In: VLDB Journal. Springer (2018)
Mandros, P., Boley, M., Vreeken, J.: Discovering reliable approximate functional dependencies. In: SIGKDD, pp. 355–363. ACM (2017)
Mandros, P., Boley, M., Vreeken, J.: Discovering reliable dependencies from data: Hardness and improved algorithms. In: ICDM, pp. 317–326. IEEE (2018)
Miyazawa, F.K., Pedrosa, L.L., Schouery, R.C., Sviridenko, M., Wakabayashi, Y.: Polynomial-time approximation schemes for circle and other packing problems. Algorithmica 76(2), 536–568 (2016)
Nehme, R.V., Rundensteiner, E.A., Bertino, E.: Self-tuning query mesh for adaptive multi-route query processing. In: EDBT, pp. 803–814 (2009)
Nguyen, H.V., Müller, E., Andritsos, P., Böhm, K.: Detecting correlated columns in relational databases with mixed data types. In: SSDBM, pp. 1–12 (2014)
Palpanas, T.: Big sequence management: A glimpse of the past, the present, and the future. In: SOFSEM, pp. 63–80. Springer (2016)
Palpanas, T.: The parallel and distributed future of data series mining. In: HPCS, pp. 916–920. IEEE (2017)
Palpanas, T.: Evolution of a data series index. In: ISIP. Springer (2019)
Palpanas, T., Beckmann, V.: Report on the first and second interdisciplinary time series analysis workshop (ITISA). In: SIGMOD. ACM (2019)
Park, Y., Cafarella, M., Mozafari, B.: Neighbor-sensitive hashing. PVLDB 9, 144–155 (2015)
Pearson, K.: The problem of the random walk. Nature 72(1865), 5558 (1905)
Peng, B., Fatourou, P., Palpanas, T.: Paris: The next destination for fast data series indexing and query answering. In: Big Data, pp. 791–800. IEEE (2018)
Peng, B., Fatourou, P., Palpanas, T.: Messi: In-memory data series indexing. In: ICDE. IEEE (2020)
Pennerath, F.: An efficient algorithm for computing entropic measures of feature subsets. In: ECML PKDD, pp. 483–499. Springer (2018)
Reimherr, M., Nicolae, D.L., et al.: On quantifying dependence: A framework for developing interpretable measures. In: Statistical Science, vol. 28, pp. 116–130. Institute of Mathematical Statistics (IMS) (2013)
Shieh, J., Keogh, E.: isax: indexing and mining terabyte sized time series. In: SIGKDD, pp. 623–631. ACM (2008)
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The hadoop distributed file system. In: MSST, pp. 1–10. IEEE (2010)
Stephenson, K.: Circle packing: a mathematical tale. Not. AMS 50(11), 1376–1388 (2003)
Stephenson, K.: Introduction to Circle Packing: The Theory of Discrete Analytic Functions. Cambridge University Press, Cambridge (2005)
Tamura, H., Yokoya, N.: Image database systems: a survey. Pattern Recogn. 17(1), 29–43 (1984)
Ullman, J.D.: Principles of database and knowledge-base systems. In: Computer Science Press, Inc , vol. 1 (1988)
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: PVLDB, vol. 98, pp. 194–205 (1998)
Wu, J., Wang, P., Pan, N., Wang, C., Wang, W., Wang, J.: Kv-match: A subsequence matching approach supporting normalization and time warping. In: ICDE. IEEE (2019)
Yagoubi, D.E., Akbarinia, R., Masseglia, F., Palpanas, T.: DPiSAX: Massively distributed partitioned iSAX. In: ICDM. IEEE (2017)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud (2010)
Zhang, L., Alghamdi, N., Eltabakh, M.Y., Rundensteiner, E.A.: TARDIS: Distributed indexing framework for big time series data. In: ICDE, pp. 1202–1213. IEEE (2019)
Zhang, L., Alghamdi, N., Eltabakh, M.Y., Rundensteiner, E.A.: Big data series analytics using TARDIS and its exploitation in geospatial applications. In: SIGMOD, pp. 2785–2788. ACM (2020)
Zoumpatianos, K., Idreos, S., Palpanas, T.: ADS: the adaptive data series index. In: VLDB Journal, vol. 25, pp. 843–866. Springer (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, L., Alghamdi, N., Zhang, H. et al. PARROT: pattern-based correlation exploitation in big partitioned data series. The VLDB Journal 32, 665–688 (2023). https://doi.org/10.1007/s00778-022-00767-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-022-00767-9