Skip to main content

A BigBench Implementation in the Hadoop Ecosystem

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8585))

Abstract

BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other systems, the queries have to be translated.

In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized using Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a proof of concept evaluation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org/

  2. 2.

    http://hive.apache.org/

  3. 3.

    https://mahout.apache.org/

  4. 4.

    http://www.nltk.org/

  5. 5.

    https://issues.apache.org/jira/browse/PIG-2397

  6. 6.

    https://issues.apache.org/jira/browse/HIVE-600

  7. 7.

    http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/

  8. 8.

    https://amplab.cs.berkeley.edu/benchmark/

  9. 9.

    http://msrg.org

References

  1. Carey, M.J.: BDMS performance evaluation: practices, pitfalls, and possibilities. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 108–123. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  2. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen., H.A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD Conference (2013)

    Google Scholar 

  3. Pöss, M., Nambiar, R.O., Walrath, D.: Why you should run TPC-DS: a workload analysis. In: VLDB, pp. 1138–1149 (2007)

    Google Scholar 

  4. Rabl, T., Ghazal, A., Hu, M., Crolotte, A., Raab, F., Poess, M., Jacobsen, H.-A.: BigBench specification V0.1. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB 2012. LNCS, vol. 8163, pp. 164–201. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  5. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute (2011). http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

  6. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 26th IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10 (2010)

    Google Scholar 

  8. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  9. Bird, S., Klein, E., Loper, E., Baldridge, J.: Multidisciplinary instruction with the natural language toolkit. In: Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, TeachCL ’08, pp. 62–70 (2008)

    Google Scholar 

  10. Moussa, R.: TPC-H benchmark analytics scenarios and performances on Hadoop data clouds. In: Benlamri, R. (ed.) NDT 2012, Part I. CCIS, vol. 293, pp. 220–234. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  11. Kim, K., Jeon, K., Han, H., Kim, S., Jung, H., Yeom, H.: MRBench: a benchmark for MapReduce framework. In: 14th IEEE International Conference on Parallel and Distributed Systems, 2008, ICPADS ’08, December 2008, pp. 11–18 (2008)

    Google Scholar 

  12. Zhao, J.M., Wang, W., Liu, X.: Big data benchmark - Big DS. In: Rabl, T., Raghunath, N., Meikel, P., Milind, B., Jacobsen, H.-A., Chaitanya, B. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 49–57. Springer, Heidelberg (2014)

    Google Scholar 

  13. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: ICDEW (2010)

    Google Scholar 

  14. Yi, L., Dai, J.: Experience from hadoop benchmarking with HiBench: from micro-benchmarks toward end-to-end pipelines. In: Rabl, T., Raghunath, N., Meikel, P., Milind, B., Jacobsen, H.-A., Chaitanya, B. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 43–48. Springer, Heidelberg (2014)

    Google Scholar 

  15. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD ’09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pp. 165–178 (2009)

    Google Scholar 

  16. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zhen, C., Lu, G., Zhan, K., Li, X., Qiu, B.: BigDataBench: a big data benchmark suite from internet services. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture. HPCA (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tilmann Rabl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., Jacobsen, HA. (2014). A BigBench Implementation in the Hadoop Ecosystem. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, HA., Baru, C. (eds) Advancing Big Data Benchmarks. WBDB WBDB 2013 2013. Lecture Notes in Computer Science(), vol 8585. Springer, Cham. https://doi.org/10.1007/978-3-319-10596-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10596-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10595-6

  • Online ISBN: 978-3-319-10596-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics