Abstract
In the big data community, Spark plays an important role and is used to process interactive queries. Spark employs a query optimizer, called Catalyst, to interpret SQL queries to optimized query execution plans. Catalyst contains a number of optimization rules and supports cost-based optimization. Although query optimization techniques have been well studied in the field of relational database systems, the effectiveness of Catalyst in Spark is still unclear. In this paper, we investigated the effectiveness of rule-based and cost-based optimization in Catalyst, meanwhile, we obtained a set of comparative experiments by varying the data volume and the number of nodes. It is found that even when applied query optimizations, the execution time of most TPC-H queries were slightly reduced. Some interesting observations were made on Catalyst, which can enable the community to have a better understanding and improvement of the query optimizer in Spark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11, S1 (2010)
Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)
Ducarme, P., Rahman, M., Brasseur, R.: IMPALA: a simple restraint field to simulate the biological membrane in molecular structure studies. Proteins Struct. Funct. Bioinform. 30(4), 357–371 (1998)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015, pp. 1383–1394. ACM (2015)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Ma, J., et al.: Logical query optimization for cloudera impala system. J. Syst. Softw. 125, 35–46 (2017)
Naacke, H., Curé, O., Amann, B.: SPARQL query processing with apache spark. arXiv preprint arXiv:1604.08903 (2016)
Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)
Esawi, A.M.K., Ashby, M.F.: Cost-based ranking for manufacturing process selection. In: Batoz, J.L., Chedmail, P., Cognet, G., Fortin, C. (eds.) Integrated Design and Manufacturing in Mechanical Engineering, pp. 603–610. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-015-9198-0_74
Wu, J.-M., Zhou, J.: Research of optimization rule of SQL based on oracle database. J. Shaanxi Univ. Technol. (2013)
Antoshenkov, G., Ziauddin, M.: Query processing and optimization in oracle RDB. VLDB J. Int. J. Very Large Data Bases 5(4), 229–237 (1996)
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 34–43. ACM (1998)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)
Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 112–121. IEEE (2016)
Liang, W., Zheng, Y.: TPC-H analysis and test tool design. Comput. Eng. Appl. (2007)
Transaction processing performance council. http://www.tpc.org
Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. (CSUR) 28(1), 121–123 (1996)
Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. ACM SIGMOD Rec. 29, 249–260 (2000)
Graefe, G., DeWitt, D.J.: The EXODUS Optimizer Generator, vol. 16. ACM (1987)
Barbas, P.M.: Database query optimization, 21 January 2014. US Patent 8,635,206
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)
Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pp. 1141–1146. ACM (2016)
Liu, C.: Research on SparkSQL query optimization based on cost model (2016)
Zhang, L.: Research on query analysis and optimization based on spark system (2016)
Wang, Z.: Spark issue. https://issues.apache.org/jira/browse/SPARK-16026
Acknowledgement
This work is supported by Key Research and Development Program of Zhejiang Province (No. 2018C01098), and the Natural Science Foundation of Zhejiang Province (NO. LY18F020014).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Ren, Z. et al. (2019). How Good is Query Optimizer in Spark?. In: Gao, H., Wang, X., Yin, Y., Iqbal, M. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 268. Springer, Cham. https://doi.org/10.1007/978-3-030-12981-1_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-12981-1_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12980-4
Online ISBN: 978-3-030-12981-1
eBook Packages: Computer ScienceComputer Science (R0)