Skip to main content

AdBench: A Complete Benchmark for Modern Data Pipelines

  • Conference paper
  • First Online:
Book cover Performance Evaluation and Benchmarking. Traditional - Big Data - Internet of Things (TPCTC 2016)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10080))

Included in the following conference series:

Abstract

Since the introduction of Apache YARN, which modularly separated resource management and scheduling from the distributed programming frameworks, a multitude of YARN-native computation frameworks have been developed. These frameworks specialize in specific analytics variants. In addition to traditional batch-oriented computations (e.g. MapReduce, Apache Hive [14] and Apache Pig [18]), the Apache Hadoop ecosystem now contains streaming analytics frameworks (e.g. Apache Apex [8]), MPP SQL engines (e.g. Apache Trafodion [20], Apache Impala [15], and Apache HAWQ [12]), OLAP cubing frameworks (e.g. Apache Kylin [17]), frameworks suitable for iterative machine learning (e.g. Apache Spark [19] and Apache Flink [10]), and graph processing (e.g. GraphX). With emergence of Hadoop Distributed File System and its various implementations as preferred method of constructing a data lake, end-to-end data pipelines are increasingly being built on the Hadoop-based data lake platform.

While benchmarks have been developed for individual tasks, such as Sort (TPCx-HS [5]), and Analytical SQL queries (TPC-xBB [6]), there is a need for a standard benchmark that exercises various phases of an end-to-end data pipeline in a data lake. In this paper, we propose a benchmark called AdBench, which combines Ad-Serving, Streaming Analytics on Ad-serving logs, streaming ingestion and updates of various data entities, batch-oriented analytics (e.g. for Billing), Ad-Hoc analytical queries, and Machine learning for Ad targeting. While this benchmark is specific to modern Web or Mobile advertising companies and exchanges, the workload characteristics are found in many verticals, such as Internet of Things (IoT), financial services, retail, and healthcare. We also propose a set of metrics to be measured for each phase of the pipeline, and various scale factors of the benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Referred to in the industry as a “Data Lake”.

  2. 2.

    We distinguish between users, who browse through Acme’s website, from customers, who publish advertisements on that website.

References

  1. Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014, vol. 8904, pp. 44–63. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  2. Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Benchmarking big data systems and the bigdata top 100 list. Big Data 1(1), 60–64 (2013)

    Article  Google Scholar 

  3. Cask Data, Inc., Cask Data Application Platform (CDAP), June 2016

    Google Scholar 

  4. Standard Performance Evaluation Corporation. SPEC Website, June 2016

    Google Scholar 

  5. Transaction Processing Performance Council. TPC Express Benchmark HS, Standard Specification, Version 1.4.0, April 2016

    Google Scholar 

  6. Transaction Processing Performance Council. TPC Express Big Bench, Standard Specification, Version 1.1.0, May 2016

    Google Scholar 

  7. Transaction Processing Performance Council. TPC Website, June 2016

    Google Scholar 

  8. Apache Software Foundation. Apache Apex, June 2016

    Google Scholar 

  9. Apache Software Foundation. Apache Cassandra, June 2016

    Google Scholar 

  10. Apache Software Foundation. Apache Flink, June 2016

    Google Scholar 

  11. Apache Software Foundation. Apache Hadoop, June 2016

    Google Scholar 

  12. Apache Software Foundation. Apache HAWQ (inbcubating), June 2016

    Google Scholar 

  13. Apache Software Foundation. Apache HBase, June 2016

    Google Scholar 

  14. Apache Software Foundation. Apache Hive, June 2016

    Google Scholar 

  15. Apache Software Foundation. Apache Impala, June 2016

    Google Scholar 

  16. Apache Software Foundation. Apache Kafka, June 2016

    Google Scholar 

  17. Apache Software Foundation. Apache Kylin, June 2016

    Google Scholar 

  18. Apache Software Foundation. Apache Pig, June 2016

    Google Scholar 

  19. Apache Software Foundation. Apache Spark, June 2016

    Google Scholar 

  20. Apache Software Foundation. Apache Trafodion (incubating), June 2016

    Google Scholar 

  21. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)

    Google Scholar 

  22. Huppler, K., Johnson, D.: TPC express – a new path for TPC benchmarks. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 48–60. Springer, Heidelberg (2014). doi:10.1007/978-3-319-04936-6_4

    Chapter  Google Scholar 

  23. MongoDB, Inc., MongoDB, June 2016

    Google Scholar 

  24. Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.): WBDB 2012. LNCS, vol. 8163. Springer, Heidelberg (2013)

    Google Scholar 

  25. Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.): WBDB 2013. LNCS, vol. 8585. Springer, Heidelberg (2014)

    Google Scholar 

  26. Rabl, T., Sachs, K., Poess, M., Baru, C., Jacobson, H.-A. (eds.): WBDB 2014. LNCS, vol. 8991. Springer, Heidelberg (2015)

    Google Scholar 

  27. Yahoo Storm Engineering Team. Benchmarking Streaming Computation Engines at Yahoo! December 2015

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Milind Bhandarkar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Bhandarkar, M. (2017). AdBench: A Complete Benchmark for Modern Data Pipelines. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. Traditional - Big Data - Internet of Things. TPCTC 2016. Lecture Notes in Computer Science(), vol 10080. Springer, Cham. https://doi.org/10.1007/978-3-319-54334-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54334-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54333-8

  • Online ISBN: 978-3-319-54334-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics