How Good is Query Optimizer in Spark?

Ren, Zujie; Yun, Na; Li, Youhuizi; Wan, Jian; Wang, Yuan; Yu, Lihua; Fan, Xinxin

doi:10.1007/978-3-030-12981-1_42

Zujie Ren¹⁹,
Na Yun¹⁹,
Youhuizi Li¹⁹,
Jian Wan²⁰,
Yuan Wang²¹,
Lihua Yu²¹ &
…
Xinxin Fan²¹

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 268))

Included in the following conference series:

International Conference on Collaborative Computing: Networking, Applications and Worksharing

916 Accesses

Abstract

In the big data community, Spark plays an important role and is used to process interactive queries. Spark employs a query optimizer, called Catalyst, to interpret SQL queries to optimized query execution plans. Catalyst contains a number of optimization rules and supports cost-based optimization. Although query optimization techniques have been well studied in the field of relational database systems, the effectiveness of Catalyst in Spark is still unclear. In this paper, we investigated the effectiveness of rule-based and cost-based optimization in Catalyst, meanwhile, we obtained a set of comparative experiments by varying the data volume and the number of nodes. It is found that even when applied query optimizations, the execution time of most TPC-H queries were slightly reduced. Some interesting observations were made on Catalyst, which can enable the community to have a better understanding and improvement of the query optimizer in Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11, S1 (2010)
Article MathSciNet Google Scholar
Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)
Article Google Scholar
Ducarme, P., Rahman, M., Brasseur, R.: IMPALA: a simple restraint field to simulate the biological membrane in molecular structure studies. Proteins Struct. Funct. Bioinform. 30(4), 357–371 (1998)
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
Google Scholar
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)
Article Google Scholar
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015, pp. 1383–1394. ACM (2015)
Google Scholar
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Ma, J., et al.: Logical query optimization for cloudera impala system. J. Syst. Softw. 125, 35–46 (2017)
Article Google Scholar
Naacke, H., Curé, O., Amann, B.: SPARQL query processing with apache spark. arXiv preprint arXiv:1604.08903 (2016)
Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)
Google Scholar
Esawi, A.M.K., Ashby, M.F.: Cost-based ranking for manufacturing process selection. In: Batoz, J.L., Chedmail, P., Cognet, G., Fortin, C. (eds.) Integrated Design and Manufacturing in Mechanical Engineering, pp. 603–610. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-015-9198-0_74
Chapter Google Scholar
Wu, J.-M., Zhou, J.: Research of optimization rule of SQL based on oracle database. J. Shaanxi Univ. Technol. (2013)
Google Scholar
Antoshenkov, G., Ziauddin, M.: Query processing and optimization in oracle RDB. VLDB J. Int. J. Very Large Data Bases 5(4), 229–237 (1996)
Article Google Scholar
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 34–43. ACM (1998)
Google Scholar
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)
Google Scholar
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 112–121. IEEE (2016)
Google Scholar
Liang, W., Zheng, Y.: TPC-H analysis and test tool design. Comput. Eng. Appl. (2007)
Google Scholar
Transaction processing performance council. http://www.tpc.org
Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. (CSUR) 28(1), 121–123 (1996)
Article Google Scholar
Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. ACM SIGMOD Rec. 29, 249–260 (2000)
Article Google Scholar
Graefe, G., DeWitt, D.J.: The EXODUS Optimizer Generator, vol. 16. ACM (1987)
Google Scholar
Barbas, P.M.: Database query optimization, 21 January 2014. US Patent 8,635,206
Google Scholar
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)
Article Google Scholar
Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pp. 1141–1146. ACM (2016)
Google Scholar
Liu, C.: Research on SparkSQL query optimization based on cost model (2016)
Google Scholar
Zhang, L.: Research on query analysis and optimization based on spark system (2016)
Google Scholar
Wang, Z.: Spark issue. https://issues.apache.org/jira/browse/SPARK-16026

Download references

Acknowledgement

This work is supported by Key Research and Development Program of Zhejiang Province (No. 2018C01098), and the Natural Science Foundation of Zhejiang Province (NO. LY18F020014).

Author information

Authors and Affiliations

School of Computer Science, Hangzhou Dianzi University, Hangzhou, China
Zujie Ren, Na Yun & Youhuizi Li
Department of Software Engineering, Zhejiang University of Science and Technology, Hangzhou, China
Jian Wan
Key Enterprise Research Institute of NetEase Big Data of Zhejiang Province, Netease Hangzhou, Network Co. Ltd., Hangzhou, China
Yuan Wang, Lihua Yu & Xinxin Fan

Authors

Zujie Ren
View author publications
You can also search for this author in PubMed Google Scholar
Na Yun
View author publications
You can also search for this author in PubMed Google Scholar
Youhuizi Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wan
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xinxin Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zujie Ren .

Editor information

Editors and Affiliations

Shanghai University, Shanghai, China
Honghao Gao
University of West London, London, UK
Xinheng Wang
Hangzhou Dianzi University, Hangzhou Shi, Zhejiang, China
Yuyu Yin
London South Bank University, London, UK
Muddesar Iqbal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, Z. et al. (2019). How Good is Query Optimizer in Spark?. In: Gao, H., Wang, X., Yin, Y., Iqbal, M. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 268. Springer, Cham. https://doi.org/10.1007/978-3-030-12981-1_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-12981-1_42
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12980-4
Online ISBN: 978-3-030-12981-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics