Skip to main content
Log in

Adaptive correlation exploitation in big data query optimization

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Correlations among the data attributes are abundant and inherent in most application domains. These correlations, if managed in systematic and efficient ways, would enable various optimization opportunities. Unfortunately, the state-of-art techniques are all heavily tailored toward optimizing factors intrinsic to relational databases, e.g., predicate selectivity, random I/O accesses, and secondary indexes, which are mostly not applicable to the modern big data infrastructures, e.g., Hadoop and Spark. In this paper, we propose the EXORD\(^+\) system for exploiting the data’s correlations in big data query optimization. EXORD\(^+\) supports two types of correlations; hard (which does not allow for exceptions) and soft (which allows for exceptions). We introduce a three-phase approach for managing soft correlations including: (1) validating and judging the worthiness of soft correlations, (2) selecting and preparing the soft correlations for deployment, and (3) exploiting the correlations in query optimization. EXORD\(^+\) introduces a novel cost-benefit model for adaptively selecting the most beneficial soft correlations given a query workload. We show the complexity of this problem (NP-Hard) and propose a heuristic to efficiently solve it in a polynomial time. Moreover, we present incremental maintenance algorithms for efficiently updating the system’s state under data appends and workload changes. EXORD\(^+\) prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD\(^+\) in achieving more than 10x speedup while introducing minimal storage overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

Notes

  1. We use term “MapReduce” to refer to the system and architecture, and term “map-reduce” to refer to a single job executed in the system.

  2. Depending on the PiggybackEnabled system-level configuration parameter (Sect. 3.1), the Prep4Deployment task is either piggybacked on the next user’s query or manually triggered as part of the “ASSESS CORRELATIONS ...” command (Sect. 7.1).

  3. If \(\mathcal {V}_{ORD}\) is not empty, then each mapper needs to examine these correlations.

  4. Definition 16 assumes un-weighted query workload and extends Definition 12. It is straightforward to extend Definition 15 for weighted query workload in the same manner.

  5. The system maintains few other configuration parameters that are omitted from the discussion.

  6. The system dumps continuous reports on the queries that are optimized by correlation-based rewriting, the queries that are not optimized, the number of times a correlation is used to optimized queries. These reports enable system admins to tune the system as desired, or even manually trigger re-assessment using the “ASSESS CORRELATIONS ..” command.

References

  1. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE ICDE, pp. 746–755 (2007)

  2. Bonomi, F., Mitzenmacher, M., Panigrahy, R., Singh, S., Varghese, G.: An improved construction for counting bloom filters. In: 14th Conference on Annual European Symposium, pp. 684–695 (2006)

    Google Scholar 

  3. Brown, P., Haas, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relational data. In: VLDB, pp. 668–679 (2003)

    Chapter  Google Scholar 

  4. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)

    Article  Google Scholar 

  5. Chen, S.: Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(2), 1459–1468 (2010)

    Article  Google Scholar 

  6. Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)

    Google Scholar 

  7. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). VLDB 3, 518–529 (2010)

    Google Scholar 

  8. Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5(6), 586–597 (2012)

    Article  Google Scholar 

  9. Eltabakh, M.Y., Özcan, F., Sismanis, Y., Haas, P., Pirahesh, H., Vondrak, J.: Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th International Conference on Extending Database Technology (EDBT), pp. 89–100 (2013)

  10. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)

    Google Scholar 

  11. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6:1–6:48 (2008)

    Article  Google Scholar 

  12. Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of the 21st International Conference on Very Large Data Bases, pp. 311–322 (1995)

  13. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42(2), 100–111 (1999)

    Article  Google Scholar 

  14. Ibarra, O., Kim, C.: Fast approximation algorithms for the knapsack and sum of subset problems. J. ACM 22, 463–468 (1975)

    Article  MathSciNet  Google Scholar 

  15. Ilyas, I.F., Markl, V., Haas, P., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: In SIGMOD, pp. 647–658 (2004)

  16. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of mapreduce: an in-depth study. Proc. VLDB Endow. 3(1–2), 472–483 (2010)

    Article  Google Scholar 

  17. Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Correlation maps: a compressed access method for exploiting soft functional dependencies. PVLDB 2(1), 1222–1233 (2009)

    Google Scholar 

  18. Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: CORADD: correlation aware database designer for materialized views and indexes. PVLDB 3(1), 1103–1113 (2010)

    Google Scholar 

  19. Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)

    Article  Google Scholar 

  20. Liu, Y., Liu, H., Xiao, D., Eltabakh, M.Y.: Adaptive Correlation Exploitation in Big Data Query Optimization. Technical Report: http://web.cs.wpi.edu/~meltabakh/WPITR1803.pdf

  21. Nguyen, H.V., Müller, E., Andritsos, P., Böhm, K.: Detecting correlated columns in relational databases with mixed data types. In: SSDBM, pp. 30:1–30:12 (2014)

  22. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3(1–2), 494–505 (2010)

    Article  Google Scholar 

  23. Svitkina, Z., Fleischer, L.: Submodular approximation: sampling-based algorithms and lower bounds. SIAM J. Comput. 40(6), 1715–1737 (2011)

    Article  MathSciNet  Google Scholar 

  24. The Apache Software Foundation. Hadoop. http://hadoop.apache.org

  25. Thusoo, A., Murthy, R., Sarma, J.S., Shao, Z., Jain, N., Chakka, P., Anthony, S., Liu, H., Zhang, N.: Hive–a petabyte scale data warehousing using hadoop. In: ICDE (2010)

  26. Ullman JD.: Principles of database and knowledge-base systems, Vol. 1, Computer Science Press, Inc. (1988)

  27. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference, pp. 10–10 (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Y. Eltabakh.

Additional information

This project is partially supported by NSF-CRI 1305258 Grant.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Liu, H., Xiao, D. et al. Adaptive correlation exploitation in big data query optimization. The VLDB Journal 27, 873–898 (2018). https://doi.org/10.1007/s00778-018-0515-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0515-8

Keywords

Navigation