AdBench: A Complete Benchmark for Modern Data Pipelines

Bhandarkar, Milind

doi:10.1007/978-3-319-54334-5_8

Milind Bhandarkar¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10080))

Included in the following conference series:

Technology Conference on Performance Evaluation and Benchmarking

1401 Accesses
4 Citations

Abstract

Since the introduction of Apache YARN, which modularly separated resource management and scheduling from the distributed programming frameworks, a multitude of YARN-native computation frameworks have been developed. These frameworks specialize in specific analytics variants. In addition to traditional batch-oriented computations (e.g. MapReduce, Apache Hive [14] and Apache Pig [18]), the Apache Hadoop ecosystem now contains streaming analytics frameworks (e.g. Apache Apex [8]), MPP SQL engines (e.g. Apache Trafodion [20], Apache Impala [15], and Apache HAWQ [12]), OLAP cubing frameworks (e.g. Apache Kylin [17]), frameworks suitable for iterative machine learning (e.g. Apache Spark [19] and Apache Flink [10]), and graph processing (e.g. GraphX). With emergence of Hadoop Distributed File System and its various implementations as preferred method of constructing a data lake, end-to-end data pipelines are increasingly being built on the Hadoop-based data lake platform.

While benchmarks have been developed for individual tasks, such as Sort (TPCx-HS [5]), and Analytical SQL queries (TPC-xBB [6]), there is a need for a standard benchmark that exercises various phases of an end-to-end data pipeline in a data lake. In this paper, we propose a benchmark called AdBench, which combines Ad-Serving, Streaming Analytics on Ad-serving logs, streaming ingestion and updates of various data entities, batch-oriented analytics (e.g. for Billing), Ad-Hoc analytical queries, and Machine learning for Ad targeting. While this benchmark is specific to modern Web or Mobile advertising companies and exchanges, the workload characteristics are found in many verticals, such as Internet of Things (IoT), financial services, retail, and healthcare. We also propose a set of metrics to be measured for each phase of the pipeline, and various scale factors of the benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Referred to in the industry as a “Data Lake”.
2.
We distinguish between users, who browse through Acme’s website, from customers, who publish advertisements on that website.

References

Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014, vol. 8904, pp. 44–63. Springer, Heidelberg (2014)
Chapter Google Scholar
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking big data systems and the bigdata top 100 list. Big Data 1(1), 60–64 (2013)
Article Google Scholar
Cask Data, Inc., Cask Data Application Platform (CDAP), June 2016
Google Scholar
Standard Performance Evaluation Corporation. SPEC Website, June 2016
Google Scholar
Transaction Processing Performance Council. TPC Express Benchmark HS, Standard Specification, Version 1.4.0, April 2016
Google Scholar
Transaction Processing Performance Council. TPC Express Big Bench, Standard Specification, Version 1.1.0, May 2016
Google Scholar
Transaction Processing Performance Council. TPC Website, June 2016
Google Scholar
Apache Software Foundation. Apache Apex, June 2016
Google Scholar
Apache Software Foundation. Apache Cassandra, June 2016
Google Scholar
Apache Software Foundation. Apache Flink, June 2016
Google Scholar
Apache Software Foundation. Apache Hadoop, June 2016
Google Scholar
Apache Software Foundation. Apache HAWQ (inbcubating), June 2016
Google Scholar
Apache Software Foundation. Apache HBase, June 2016
Google Scholar
Apache Software Foundation. Apache Hive, June 2016
Google Scholar
Apache Software Foundation. Apache Impala, June 2016
Google Scholar
Apache Software Foundation. Apache Kafka, June 2016
Google Scholar
Apache Software Foundation. Apache Kylin, June 2016
Google Scholar
Apache Software Foundation. Apache Pig, June 2016
Google Scholar
Apache Software Foundation. Apache Spark, June 2016
Google Scholar
Apache Software Foundation. Apache Trafodion (incubating), June 2016
Google Scholar
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)
Google Scholar
Huppler, K., Johnson, D.: TPC express – a new path for TPC benchmarks. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 48–60. Springer, Heidelberg (2014). doi:10.1007/978-3-319-04936-6_4
Chapter Google Scholar
MongoDB, Inc., MongoDB, June 2016
Google Scholar
Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.): WBDB 2012. LNCS, vol. 8163. Springer, Heidelberg (2013)
Google Scholar
Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.): WBDB 2013. LNCS, vol. 8585. Springer, Heidelberg (2014)
Google Scholar
Rabl, T., Sachs, K., Poess, M., Baru, C., Jacobson, H.-A. (eds.): WBDB 2014. LNCS, vol. 8991. Springer, Heidelberg (2015)
Google Scholar
Yahoo Storm Engineering Team. Benchmarking Streaming Computation Engines at Yahoo! December 2015
Google Scholar

Download references

Author information

Authors and Affiliations

Ampool Inc., Santa Clara, USA
Milind Bhandarkar

Authors

Milind Bhandarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Milind Bhandarkar .

Editor information

Editors and Affiliations

Cisco Systems, Inc., San Jose, California, USA
Raghunath Nambiar
Oracle Corporation, Redwood City, California, USA
Meikel Poess

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhandarkar, M. (2017). AdBench: A Complete Benchmark for Modern Data Pipelines. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. Traditional - Big Data - Internet of Things. TPCTC 2016. Lecture Notes in Computer Science(), vol 10080. Springer, Cham. https://doi.org/10.1007/978-3-319-54334-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-54334-5_8
Published: 18 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54333-8
Online ISBN: 978-3-319-54334-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics