Abstract
Benchmarks are important tools to evaluate systems, as long as their results are transparent, reproducible and they are conducted with due diligence. Today, many SQL-on-Hadoop vendors use the data generators and the queries of existing TPC benchmarks, but fail to adhere to the rules, producing results that are not transparent. As the SQL-on-Hadoop movement continues to gain more traction, it is important to bring some order to this “wild west” of benchmarking. First, new rules and policies should be defined to satisfy the demands of the new generation SQL systems. The new benchmark evaluation schemes should be inexpensive, effective and open enough to embrace the variety of SQL-on-Hadoop systems and their corresponding vendors. Second, adhering to the new standards requires industry commitment and collaboration. In this paper, we discuss the problems we observe in the current practices of benchmarking, and present our proposal for bringing standardization in the SQL-on-Hadoop space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
AMPLAB Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/
Apache Hive. http://hive.apache.org/
IBM InfoSphere BigInsights. http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/whatsnew.html
Chen, Y., Raab, F., Katz, R.: From TPC-C to big data benchmarks: a functional workload model. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB 2012. LNCS, vol. 8163, pp. 28–43. Springer, Heidelberg (2014)
Cloudera Impala. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
Cloudera Impala Technical Deep Dive. http://www.slideshare.net/huguk/hug-london2013
Costley, J., Lankford, P.: Big Data Cases in Banking and Securities (2014). https://stacresearch.com/news/2014/05/30/big-data-use-cases-banking-and-securities
DeWitt, D.J., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: ACM SIGMOD, pp. 1255–1266 (2013)
TPC Express. http://www.tpc.org/tpctc/tpctc2013/slides_and_papers/004.pdf
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: ACM SIGMOD, pp. 1197–1208 (2013)
Gray, J. (ed.): The Benchmark Handbook for Database and Transaction Systems, 2nd edn. Morgan Kaufmann, San Francisco (1993). http://research.microsoft.com/en-us/um/people/gray/benchmarkhandbook/chapter2.pdf
Groves, T.: The Big Deal about InfoSphere BigInsights v3.0 is Big SQL. http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql
Impala TPC-DS Kit. https://github.com/cloudera/impala-tpcds-kit
ORCFile in HDP 2.0. http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
Ozcan, F., Harris, S.: Blistering Fast SQL Access to Your Hadoop Data. http://www.ibmbigdatahub.com/blog/blistering-fast-sql-access-your-hadoop-datal
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: ACM SIGMOD. ACM, New York (2009)
Pivotal HAWQ. http://pivotalhd.docs.gopivotal.com/getting-started/hawq.html
Presto. http://prestodb.io/
SPEC: Standard Performance Evaluation Corporation. http://www.spec.org/
STAC: Security Technology Analysis Center. https://stacresearch.com/
Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Pawluk, P., Zuzarte, C.: Queries on dates: fast yet not blind. In: Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT 2011, pp. 497–502. ACM, New York (2011)
Transaction Processing Performance Council. http://www.tpc.org
The TPC-DS Benchmark. http://www.tpc.org/tpcds/
TPC-DS-like Workload on Impala (part 1). http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
TPC-DS-like Workload on Impala (part 2). http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/
TPC-DS-like Workload on Impala (part 3). http://blog.cloudera.com/blog/2014/09/new-benchmarks-for-sql-on-hadoop-impala-1-4-widens-the-performance-gap/
The TPC-H Benchmark. http://www.tpc.org/tpch/
TPC-H Scripts for Hive. https://issues.apache.org/jira/browse/HIVE-600
TPC-H Scripts for Impala. https://github.com/kj-ki/tpc-h-impala
The TPCx-HS Benchmark. http://www.tpc.org/tpcx-hs/spec/tpcx-hs-specification-v1.1.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Floratou, A., Özcan, F., Schiefer, B. (2015). Benchmarking SQL-on-Hadoop Systems: TPC or Not TPC?. In: Rabl, T., Sachs, K., Poess, M., Baru, C., Jacobson, HA. (eds) Big Data Benchmarking. WBDB 2014. Lecture Notes in Computer Science(), vol 8991. Springer, Cham. https://doi.org/10.1007/978-3-319-20233-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-20233-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20232-7
Online ISBN: 978-3-319-20233-4
eBook Packages: Computer ScienceComputer Science (R0)