AQUA+: Query Optimization for Hybrid Database-MapReduce System

Pang, Zhifei; Wu, Sai; Huang, Haichao; Hong, Zhouzhenyan; Xie, Yuqing

doi:10.1007/s10115-020-01542-4

AQUA+: Query Optimization for Hybrid Database-MapReduce System

Regular Paper
Published: 05 February 2021

Volume 63, pages 905–938, (2021)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Zhifei Pang^1,3,
Sai Wu¹,
Haichao Huang²,
Zhouzhenyan Hong¹ &
…
Yuqing Xie²

412 Accesses
Explore all metrics

Abstract

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data in a distributed database. It is costly to export those data into the storage system of MapReduce (normally a distributed file system). Moreover, compared to MapReduce, database is equipped with many state-of-the-art techniques, such as index and optimizer. Therefore, a hybrid Database-MapReduce system inheriting the advantages of both systems is preferred. In this paper, we propose AQUA+, a query optimizer tailored for the hybrid system. AQUA+ is an extension work of our previous system AQUA. It generates a plan that adaptively assigns the operators to the database engine and MapReduce engine to optimize the performance. The intuition is to exploit the index, co-partition and other features provided by the database as much as possible and reduce the data volume processed by the MapReduce. Due to the complexity of query optimization, in AQUA+, we introduce a novel tuning technique, learning to optimize. In particular, two neural networks are trained to predict cost and refine query plan, respectively. We train them based on our log of real query processing. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce-DBMS: An Integration Model for Big Data Management and Optimization

DORA: A Reliability-Associated Query Optimization Framework for Plan Selection

SINGLE vs. MapReduce vs. Relational: Predicting Query Execution Time

Notes

References

Abouzied A, Abadi DJ, Bajda-Pawlikowski K, Silberschatz A (2019) ‘Integration of large-scale data processing systems and traditional parallel database technology’, Proc. VLDB Endow. f12(12), 2290–2299. http://www.vldb.org/pvldb/vol12/p2290-abouzied.pdf
Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications, in ‘SIGMOD Conference’, pp. 1111–1114
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment, In Proceedings of the 13th International Conference on Extending Database Technology (pp. 99-110)
Bernstein PA, Goodman N, Wong E, Reeve CL, Rothnie JB Jr (1981) Query processing in a system for distributed databases (sdd-1). ACM Trans. Database Syst. 6(4):602–625
Article Google Scholar
Bittorf M, Bobrovytsky T, Erickson C C A C J, Hecht M GD, Kuff M J I JL, Leblang D KA, Robinson N L I PH, Rus D RS, Wanderman J R D TS, Yoder MM (2015) Impala: A modern, open-source sql engine for hadoop, in Proceedings of the 7th Biennial Conference on Innovative Data Systems Research
Camacho-Rodríguez J, Chauhan A, Gates A, Koifman E, O’Malley O, Garg V, Haindrich Z, Shelukhin S, Jayachandran P, Seth S, Jaiswal D, Bouguerra S, Bangarwa N, Hariappan S, Agarwal A, Dere J, Dai D, Nair T, Dembla N, Vijayaraghavan G, Hagleitner G (2019) Apache hive: From mapreduce to enterprise-grade big data warehousing, in P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande and T. Kraska, eds, ‘Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019’, ACM, pp. 1773–1786. https://doi.org/10.1145/3299869.3314045
Chaudhuri S (1998) An overview of query optimization in relational systems, in PODS, pp. 34–43
Chen M-S, Yu PS, Wu K-L (1996) Optimization of parallel execution for multi-join queries. IEEE Trans on Knowl and Data Eng 8(3):416–428
Article Google Scholar
Chen S (2010) Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow. 3(1–2):1459–1468
Article Google Scholar
Chen Z, Gehrke J, Korn F (2001) Query optimization in compressed database systems, in Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pp. 271–282. https://doi.org/10.1145/375663.375692
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) Mapreduce online. In NSDI 10:21–29
Google Scholar
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters, in OSDI, 137–150
Franklin MJ, Jónsson BT, Kossmann D (1996) Performance tradeoffs for client-server query processing. SIGMOD Rec. 25(2):149–160
Article Google Scholar
Friedman E, Pawlowski PM, Cieslewicz J (2009) Sql/mapreduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB 2(2):1402–1413
Google Scholar
Ganguly S, Hasan W, Krishnamurthy R (1992) Query optimization for parallel execution. SIGMOD Rec. 21(2):
Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1(2):100–104
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Jarke M, Koch J (1984) Query optimization in database systems. ACM Comput. Surv. 16(2):111–152
Article MathSciNet Google Scholar
Jia Y (2009) Running tpc-h queries on hive, in http://issues.apache.org/jira/browse/HIVE-600
Lin Y, Agrawal D, Chen C, Ooi BC, Wu S (2011) Llama: leveraging columnar storage for scalable join processing in the mapreduce framework, in SIGMOD Conference, 961–972
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing, in SIGMOD Conference, 1099–1110
Ozsu MT (2007) Principles of distributed database systems, 3rd edn. Prentice Hall Press, NJ, USA
Google Scholar
Poosala V, Haas PJ, Ioannidis YE, Shekita EJ (1996) Improved histograms for selectivity estimation of range predicates, in SIGMOD Conference, 294–305
Stewart RJ, Trinder PW, Loidl H-W (2011) Comparing high level mapreduce query languages, in APPT, 58–72
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 1556–1566. https://www.aclweb.org/anthology/P15-1150/
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2):1626–1629
Google Scholar
bibitemhiveicde Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010) Hive - a petabyte scale data warehouse using hadoop, In ICDE, 996–1005
Traverso M (2013) Presto: Interacting with petabytes of data at facebook. Retrieved February 4, 2014
Wu S, Li F, Mehrotra S, Ooi BC (2011) Query optimization for massively parallel data processing, In SoCC , 12

Download references

Acknowledgements

This work was supported by the Key Research Program of Zhejiang Province [grant numbers 2021C01109]; the Zhejiang Provincial Natural Science Foundation [grant numbers LZ21F020007]; and the Grid State Foundation [grant number 5211XT190033].

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, China
Zhifei Pang, Sai Wu & Zhouzhenyan Hong
State Grid Zhejiang Electric Power Company Information & Telecommunication Branch, Hangzhou, China
Haichao Huang & Yuqing Xie
Institute of Computing Innovation, Zhejiang University, Hangzhou, China
Zhifei Pang

Authors

Zhifei Pang
View author publications
You can also search for this author inPubMed Google Scholar
Sai Wu
View author publications
You can also search for this author inPubMed Google Scholar
Haichao Huang
View author publications
You can also search for this author inPubMed Google Scholar
Zhouzhenyan Hong
View author publications
You can also search for this author inPubMed Google Scholar
Yuqing Xie
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Sai Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pang, Z., Wu, S., Huang, H. et al. AQUA+: Query Optimization for Hybrid Database-MapReduce System. Knowl Inf Syst 63, 905–938 (2021). https://doi.org/10.1007/s10115-020-01542-4

Download citation

Received: 25 January 2020
Revised: 25 December 2020
Accepted: 30 December 2020
Published: 05 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10115-020-01542-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AQUA+: Query Optimization for Hybrid Database-MapReduce System

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MapReduce-DBMS: An Integration Model for Big Data Management and Optimization

DORA: A Reliability-Associated Query Optimization Framework for Plan Selection

SINGLE vs. MapReduce vs. Relational: Predicting Query Execution Time

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now