Skip to main content
Log in

AQUA+: Query Optimization for Hybrid Database-MapReduce System

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data in a distributed database. It is costly to export those data into the storage system of MapReduce (normally a distributed file system). Moreover, compared to MapReduce, database is equipped with many state-of-the-art techniques, such as index and optimizer. Therefore, a hybrid Database-MapReduce system inheriting the advantages of both systems is preferred. In this paper, we propose AQUA+, a query optimizer tailored for the hybrid system. AQUA+ is an extension work of our previous system AQUA. It generates a plan that adaptively assigns the operators to the database engine and MapReduce engine to optimize the performance. The intuition is to exploit the index, co-partition and other features provided by the database as much as possible and reduce the data volume processed by the MapReduce. Due to the complexity of query optimization, in AQUA+, we introduce a novel tuning technique, learning to optimize. In particular, two neural networks are trained to predict cost and refine query plan, respectively. We train them based on our log of real query processing. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. http://www.aster.com.

  2. http://www.greenplum.com.

  3. http://hadoop.apache.org/.

References

  1. Abouzied A, Abadi DJ, Bajda-Pawlikowski K, Silberschatz A (2019) ‘Integration of large-scale data processing systems and traditional parallel database technology’, Proc. VLDB Endow. f12(12), 2290–2299. http://www.vldb.org/pvldb/vol12/p2290-abouzied.pdf

  2. Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications, in ‘SIGMOD Conference’, pp. 1111–1114

  3. Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment, In Proceedings of the 13th International Conference on Extending Database Technology (pp. 99-110)

  4. Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB Jr (1981) Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst. 6(4):602–625

    Article  Google Scholar 

  5. Bittorf M, Bobrovytsky T, Erickson C C A C J, Hecht M GD, Kuff M J I JL, Leblang D KA, Robinson N L I PH, Rus D RS, Wanderman J R D TS, Yoder MM (2015) Impala: A modern, open-source sql engine for hadoop, in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research

  6. Camacho-Rodríguez J, Chauhan A, Gates A, Koifman E, O’Malley O, Garg V, Haindrich Z, Shelukhin S, Jayachandran P, Seth S, Jaiswal D, Bouguerra S, Bangarwa N, Hariappan S, Agarwal A, Dere J, Dai D, Nair T, Dembla N, Vijayaraghavan G, Hagleitner G (2019) Apache hive: From mapreduce to enterprise-grade big data warehousing, in P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande and T. Kraska, eds, ‘Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019’, ACM, pp. 1773–1786. https://doi.org/10.1145/3299869.3314045

  7. Chaudhuri S (1998) An overview of query optimization in relational systems, in PODS, pp. 34–43

  8. Chen M-S, Yu PS, Wu K-L (1996) Optimization of parallel execution for multi-join queries. IEEE Trans on Knowl and Data Eng 8(3):416–428

    Article  Google Scholar 

  9. Chen S (2010) Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(1–2):1459–1468

    Article  Google Scholar 

  10. Chen Z, Gehrke J, Korn F (2001) Query optimization in compressed database systems, in Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pp. 271–282. https://doi.org/10.1145/375663.375692

  11. Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) Mapreduce online. In NSDI 10:21–29

    Google Scholar 

  12. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, in OSDI, 137–150

  13. Franklin MJ, Jónsson BT, Kossmann D (1996) Performance tradeoffs for client-server query processing. SIGMOD Rec. 25(2):149–160

    Article  Google Scholar 

  14. Friedman E, Pawlowski PM, Cieslewicz J (2009) Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2):1402–1413

    Google Scholar 

  15. Ganguly S, Hasan W, Krishnamurthy R (1992) Query optimization for parallel execution. SIGMOD Rec. 21(2):

  16. Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2):100–104

    Article  Google Scholar 

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  18. Jarke M, Koch J (1984) Query optimization in database systems. ACM Comput. Surv. 16(2):111–152

    Article  MathSciNet  Google Scholar 

  19. Jia Y (2009) Running tpc-h queries on hive, in http://issues.apache.org/jira/browse/HIVE-600

  20. Lin Y, Agrawal D, Chen C, Ooi BC, Wu S (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework, in SIGMOD Conference, 961–972

  21. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing, in SIGMOD Conference, 1099–1110

  22. Ozsu MT (2007) Principles of distributed database systems, 3rd edn. Prentice Hall Press, NJ, USA

    Google Scholar 

  23. Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates, in SIGMOD Conference, 294–305

  24. Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages, in APPT, 58–72

  25. Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 1556–1566. https://www.aclweb.org/anthology/P15-1150/

  26. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2):1626–1629

    Google Scholar 

  27. bibitemhiveicde Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive - a petabyte scale data warehouse using hadoop, In ICDE, 996–1005

  28. Traverso M (2013) Presto: Interacting with petabytes of data at facebook. Retrieved February 4, 2014

  29. Wu S, Li F, Mehrotra S, Ooi BC (2011) Query optimization for massively parallel data processing, In SoCC , 12

Download references

Acknowledgements

This work was supported by the Key Research Program of Zhejiang Province [grant numbers 2021C01109]; the Zhejiang Provincial Natural Science Foundation [grant numbers LZ21F020007]; and the Grid State Foundation [grant number 5211XT190033].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sai Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pang, Z., Wu, S., Huang, H. et al. AQUA+: Query Optimization for Hybrid Database-MapReduce System. Knowl Inf Syst 63, 905–938 (2021). https://doi.org/10.1007/s10115-020-01542-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01542-4

Keywords

Navigation