Performance Evaluation of Spark SQL Using BigBench

Ivanov, Todor; Beer, Max-Georg

doi:10.1007/978-3-319-49748-8_6

Todor Ivanov¹⁹ &
Max-Georg Beer¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10044))

Included in the following conference series:

1035 Accesses
4 Citations

Abstract

In this paper we present the initial results of our work to execute BigBench on Spark. First, we evaluated the scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive ones. Our experiments show that: (1) for both Hive and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Big SQL systems: an experimental evaluation

Article 11 February 2019

On the performance of SQL scalable systems on Kubernetes: a comparative study

Article Open access 09 September 2022

RDF Data Partitioning for Efficient SPARQL Query Processing with Spark SQL

References

Chen, Y.: We don’t know enough to make a big data benchmark suite-an academia-industry view. In: Proceeding WBDB, 2012 (2012)
Google Scholar
Carey, Michael, J.: BDMS performance evaluation: practices, pitfalls, and possibilities. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 108–123. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36727-4_8
Chapter Google Scholar
Chen, Y., Raab, F., Katz, R.: From TPC-C to big data benchmarks: a functional workload model. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 28–43. Springer, Heidelberg (2014). doi:10.1007/978-3-642-53974-9_4
Chapter Google Scholar
Nambiar, R., Poess, M., Dey, A., Cao, P., Magdon-Ismail, T., Ren, D.Q., Bond, A.: Introducing TPCx-HS: the first industry standard for benchmarking big data systems. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014. LNCS, vol. 8904, pp. 1–12. Springer, Heidelberg (2014)
Chapter Google Scholar
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the direction for big data benchmark standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013). doi:10.1007/978-3-642-36727-4_14
Chapter Google Scholar
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 1197–1208 (2013)
Google Scholar
Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014. LNCS, vol. 8904, pp. 44–63. Springer, Heidelberg (2015). doi:10.1007/978-3-319-15350-6_4
Chapter Google Scholar
TPC, “TPCx-BB.” http://www.tpc.org/tpcx-bb
TPC, “TPC-DS.” http://www.tpc.org/tpcds/
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H., Big data: the next frontier for innovation, competition, and productivity. McKinsey Glob. Inst., pp. 1–137 (2011)
Google Scholar
Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011). doi:10.1007/978-3-642-18206-8_4
Chapter Google Scholar
Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., Jacobsen, H.-A.: A BigBench implementation in the hadoop ecosystem. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 3–18. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10596-3_1
Google Scholar
Big-Data-Benchmark-for-Big-Bench GitHub. https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, p. 2 (2012)
Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)
Google Scholar
Frankfurt Big Data Lab, “Big-Bench-Setup GitHub”. https://github.com/BigData-Lab-Frankfurt/Big-Bench-Setup
Ivanov, T., Beer, M.-G.: Evaluating hive and spark SQL with BigBench, arXiv:1512.08417 (2015)
Harsch, T.: Parse-big-bench utility - bitbucket. https://bitbucket.org/tharsch/parse-big-bench
Ryza, S.: How-to: tune your apache spark jobs (Part 2) | Cloudera Engineering Blog, 30March 2015
Google Scholar
Yi Z.: [SPARK-5791] [Spark SQL] show poor performance when multiple table do join operation. https://issues.apache.org/jira/browse/SPARK-5791
Intel, “PAT Tool GitHub”. https://github.com/intel-hadoop/PAT
Rabl, T., Ghazal, A., Hu, M., Crolotte, A., Raab, F., Poess, M., Jacobsen, H.-A.: BigBench specification V0.1. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 164–201. Springer, Heidelberg (2014). doi:10.1007/978-3-642-53974-9_14
Chapter Google Scholar
Apache OpenNLP. https://opennlp.apache.org/

Download references

Acknowledgment

This work has benefited from valuable discussions in the SPEC Research Group’s Big Data Working Group. We would like to thank Tilmann Rabl (University of Toronto), John Poelman (IBM), Bhaskar Gowda (Intel), Yi Yao (Intel), Marten Rosselli, Karsten Tolle, Roberto V. Zicari and Raik Niemann of the Frankfurt Big Data Lab for their valuable feedback. We would like to thank the Fields Institute for supporting our visit to the Sixth Workshop on Big Data Benchmarking at the University of Toronto.

Author information

Authors and Affiliations

Frankfurt Big Data Lab, Goethe University Frankfurt am Main, Frankfurt am Main, Germany
Todor Ivanov & Max-Georg Beer

Authors

Todor Ivanov
View author publications
You can also search for this author in PubMed Google Scholar
Max-Georg Beer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Todor Ivanov .

Editor information

Editors and Affiliations

Technical University of Berlin, Berlin, Germany
Tilmann Rabl
Cisco Systems, Inc., San Jose, California, USA
Raghunath Nambiar
University of California at San Diego, La Jolla, California, USA
Chaitanya Baru
Ampool, Inc., Santa Clara, California, USA
Milind Bhandarkar
Oracle Corporation, Redwood Shores, California, USA
Meikel Poess
Indian Institute of Public Health, Hyderabad, India
Saumyadipta Pyne

A. BigBench Queries’ Resource Utilization

See Figs. 6, 7, 8, 9, 10 and 11.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ivanov, T., Beer, MG. (2016). Performance Evaluation of Spark SQL Using BigBench. In: Rabl, T., Nambiar, R., Baru, C., Bhandarkar, M., Poess, M., Pyne, S. (eds) Big Data Benchmarking. WBDB WBDB 2015 2015. Lecture Notes in Computer Science(), vol 10044. Springer, Cham. https://doi.org/10.1007/978-3-319-49748-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-49748-8_6
Published: 01 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49747-1
Online ISBN: 978-3-319-49748-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance Evaluation of Spark SQL Using BigBench

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Big SQL systems: an experimental evaluation

On the performance of SQL scalable systems on Kubernetes: a comparative study

RDF Data Partitioning for Efficient SPARQL Query Processing with Spark SQL

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A. BigBench Queries’ Resource Utilization

A. BigBench Queries’ Resource Utilization

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us