Adaptive correlation exploitation in big data query optimization

Liu, Yuchen; Liu, Hai; Xiao, Dongqing; Eltabakh, Mohamed Y.

doi:10.1007/s00778-018-0515-8

Adaptive correlation exploitation in big data query optimization

Regular Paper
Published: 28 July 2018

Volume 27, pages 873–898, (2018)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yuchen Liu¹,
Hai Liu¹,
Dongqing Xiao¹ &
…
Mohamed Y. Eltabakh ORCID: orcid.org/0000-0002-6344-8246¹

547 Accesses
7 Citations
Explore all metrics

Abstract

Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD\(^+\) system for exploiting the data’s correlations in big data query optimization. EXORD\(^+\) supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD\(^+\) introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD\(^+\) prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD\(^+\) in achieving more than 10x speedup while introducing minimal storage overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

We use term “MapReduce” to refer to the system and architecture, and term “map-reduce” to refer to a single job executed in the system.
Depending on the PiggybackEnabled system-level configuration parameter (Sect. 3.1), the Prep4Deployment task is either piggybacked on the next user’s query or manually triggered as part of the “ASSESS CORRELATIONS ...” command (Sect. 7.1).
If \(\mathcal {V}_{ORD}\) is not empty, then each mapper needs to examine these correlations.
Definition 16 assumes un-weighted query workload and extends Definition 12. It is straightforward to extend Definition 15 for weighted query workload in the same manner.
The system maintains few other configuration parameters that are omitted from the discussion.
The system dumps continuous reports on the queries that are optimized by correlation-based rewriting, the queries that are not optimized, the number of times a correlation is used to optimized queries. These reports enable system admins to tune the system as desired, or even manually trigger re-assessment using the “ASSESS CORRELATIONS ..” command.

References

Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE ICDE, pp. 746–755 (2007)
Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., Varghese, G.: An improved construction for counting bloom filters. In: 14th Conference on Annual European Symposium, pp. 684–695 (2006)
Google Scholar
Brown, P., Haas, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relational data. In: VLDB, pp. 668–679 (2003)
Chapter Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)
Article Google Scholar
Chen, S.: Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(2), 1459–1468 (2010)
Article Google Scholar
Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). VLDB 3, 518–529 (2010)
Google Scholar
Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5(6), 586–597 (2012)
Article Google Scholar
Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th International Conference on Extending Database Technology (EDBT), pp. 89–100 (2013)
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6:1–6:48 (2008)
Article Google Scholar
Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases, pp. 311–322 (1995)
Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)
Article Google Scholar
Ibarra, O., Kim, C.: Fast approximation algorithms for the knapsack and sum of subset problems. J. ACM 22, 463–468 (1975)
Article MathSciNet Google Scholar
Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: In SIGMOD, pp. 647–658 (2004)
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of mapreduce: an in-depth study. Proc. VLDB Endow. 3(1–2), 472–483 (2010)
Article Google Scholar
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. PVLDB 2(1), 1222–1233 (2009)
Google Scholar
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: CORADD: correlation aware database designer for materialized views and indexes. PVLDB 3(1), 1103–1113 (2010)
Google Scholar
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)
Article Google Scholar
Liu, Y., Liu, H., Xiao, D., Eltabakh, M.Y.: Adaptive Correlation Exploitation in Big Data Query Optimization. Technical Report: http://web.cs.wpi.edu/~meltabakh/WPITR1803.pdf
Nguyen, H.V., Müller, E., Andritsos, P., Böhm, K.: Detecting correlated columns in relational databases with mixed data types. In: SSDBM, pp. 30:1–30:12 (2014)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3(1–2), 494–505 (2010)
Article Google Scholar
Svitkina, Z., Fleischer, L.: Submodular approximation: sampling-based algorithms and lower bounds. SIAM J. Comput. 40(6), 1715–1737 (2011)
Article MathSciNet Google Scholar
The Apache Software Foundation. Hadoop. http://hadoop.apache.org
Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive–a petabyte scale data warehousing using hadoop. In: ICDE (2010)
Ullman JD.: Principles of database and knowledge-base systems, Vol. 1, Computer Science Press, Inc. (1988)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference, pp. 10–10 (2010)

Download references

Author information

Authors and Affiliations

Computer Science Department, Worcester Polytechnic Institute (WPI), 100 Institute Rd., Worcester, MA, 01609, USA
Yuchen Liu, Hai Liu, Dongqing Xiao & Mohamed Y. Eltabakh

Authors

Yuchen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Y. Eltabakh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Y. Eltabakh.

Additional information

This project is partially supported by NSF-CRI 1305258 Grant.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Y., Liu, H., Xiao, D. et al. Adaptive correlation exploitation in big data query optimization. The VLDB Journal 27, 873–898 (2018). https://doi.org/10.1007/s00778-018-0515-8

Download citation

Received: 08 January 2018
Revised: 11 July 2018
Accepted: 14 July 2018
Published: 28 July 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s00778-018-0515-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive correlation exploitation in big data query optimization

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adaptive correlation exploitation in big data query optimization

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation