Skip to main content

How Good is Query Optimizer in Spark?

  • Conference paper
  • First Online:
Book cover Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2018)

Abstract

In the big data community, Spark plays an important role and is used to process interactive queries. Spark employs a query optimizer, called Catalyst, to interpret SQL queries to optimized query execution plans. Catalyst contains a number of optimization rules and supports cost-based optimization. Although query optimization techniques have been well studied in the field of relational database systems, the effectiveness of Catalyst in Spark is still unclear. In this paper, we investigated the effectiveness of rule-based and cost-based optimization in Catalyst, meanwhile, we obtained a set of comparative experiments by varying the data volume and the number of nodes. It is found that even when applied query optimizations, the execution time of most TPC-H queries were slightly reduced. Some interesting observations were made on Catalyst, which can enable the community to have a better understanding and improvement of the query optimizer in Spark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11, S1 (2010)

    Article  MathSciNet  Google Scholar 

  2. Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)

    Article  Google Scholar 

  3. Ducarme, P., Rahman, M., Brasseur, R.: IMPALA: a simple restraint field to simulate the biological membrane in molecular structure studies. Proteins Struct. Funct. Bioinform. 30(4), 357–371 (1998)

    Article  Google Scholar 

  4. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)

    Google Scholar 

  5. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)

    Article  Google Scholar 

  6. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015, pp. 1383–1394. ACM (2015)

    Google Scholar 

  7. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  8. Ma, J., et al.: Logical query optimization for cloudera impala system. J. Syst. Softw. 125, 35–46 (2017)

    Article  Google Scholar 

  9. Naacke, H., Curé, O., Amann, B.: SPARQL query processing with apache spark. arXiv preprint arXiv:1604.08903 (2016)

  10. Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)

    Google Scholar 

  11. Esawi, A.M.K., Ashby, M.F.: Cost-based ranking for manufacturing process selection. In: Batoz, J.L., Chedmail, P., Cognet, G., Fortin, C. (eds.) Integrated Design and Manufacturing in Mechanical Engineering, pp. 603–610. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-015-9198-0_74

    Chapter  Google Scholar 

  12. Wu, J.-M., Zhou, J.: Research of optimization rule of SQL based on oracle database. J. Shaanxi Univ. Technol. (2013)

    Google Scholar 

  13. Antoshenkov, G., Ziauddin, M.: Query processing and optimization in oracle RDB. VLDB J. Int. J. Very Large Data Bases 5(4), 229–237 (1996)

    Article  Google Scholar 

  14. Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 34–43. ACM (1998)

    Google Scholar 

  15. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)

    Google Scholar 

  16. Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 112–121. IEEE (2016)

    Google Scholar 

  17. Liang, W., Zheng, Y.: TPC-H analysis and test tool design. Comput. Eng. Appl. (2007)

    Google Scholar 

  18. Transaction processing performance council. http://www.tpc.org

  19. Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. (CSUR) 28(1), 121–123 (1996)

    Article  Google Scholar 

  20. Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. ACM SIGMOD Rec. 29, 249–260 (2000)

    Article  Google Scholar 

  21. Graefe, G., DeWitt, D.J.: The EXODUS Optimizer Generator, vol. 16. ACM (1987)

    Google Scholar 

  22. Barbas, P.M.: Database query optimization, 21 January 2014. US Patent 8,635,206

    Google Scholar 

  23. Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)

    Article  Google Scholar 

  24. Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pp. 1141–1146. ACM (2016)

    Google Scholar 

  25. Liu, C.: Research on SparkSQL query optimization based on cost model (2016)

    Google Scholar 

  26. Zhang, L.: Research on query analysis and optimization based on spark system (2016)

    Google Scholar 

  27. Wang, Z.: Spark issue. https://issues.apache.org/jira/browse/SPARK-16026

Download references

Acknowledgement

This work is supported by Key Research and Development Program of Zhejiang Province (No. 2018C01098), and the Natural Science Foundation of Zhejiang Province (NO. LY18F020014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zujie Ren .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ren, Z. et al. (2019). How Good is Query Optimizer in Spark?. In: Gao, H., Wang, X., Yin, Y., Iqbal, M. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 268. Springer, Cham. https://doi.org/10.1007/978-3-030-12981-1_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-12981-1_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-12980-4

  • Online ISBN: 978-3-030-12981-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics