ABSTRACT
BigBench standardized as TPCx-BB is a popular application benchmark that targets Big Data storage and processing systems. BigBench V2 addresses some of the BigBench limitations by introducing a new simplified data model, semi-structured web logs in JSON file format and new queries mandating late binding. However, it still covers only batch processing workloads and the Big Data velocity characteristic is not addressed. This work extends the BigBench V2 benchmark with a data streaming component that simulates typical statistical and predictive analytics queries in a retail business scenario. Our approach is to preserve the existing BigBench design and introduce a new streaming component that supports two data streaming modes: active and passive. In active mode, the data stream generation and processing happen in parallel, whereas in passive mode, the data stream is pre-generated in advance before the actual stream processing. The stream workload consists of five queries inspired by the existing 30 BigBench queries. To validate the proposed streaming extension, the two streaming modes were implemented and tested using Kafka and Spark Streaming. The experimental results prove the feasibility of our benchmark design. Finally, we outline design challenges and future plans for improving the proposed BigBench extension.
- Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6 (2013). Google ScholarDigital Library
- Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear Road: A Stream Data Management Benchmark. In the 30th VLDB, Toronto, Canada, Aug. 31-Sept. 3, 2004. Google ScholarDigital Library
- Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, and Ning Liang. 2015. Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database. In Proceedings of the SIGMOD 2015, Melbourne, Victoria, Australia, May 31-June 4, 2015. 251--264. Google ScholarDigital Library
- Apache Calcite. 2017. https://calcite.apache.org/. (2017).Google Scholar
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38.Google Scholar
- Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IPDPS Workshops, Chicago, IL, USA, May 23-27.Google Scholar
- Apache Drill. 2017. drill.apache.org. (2017).Google Scholar
- Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 1225--1236.Google Scholar
- Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for Big Data Analytics. In SIGMOD 2013. 1197--1208. Google ScholarDigital Library
- Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th IEEE Data Engineering Workshops (ICDEW), 2010. IEEE.Google ScholarCross Ref
- Todor Ivanov and Max-Georg Beer. 2015. Performance evaluation of spark SQL using BigBench. In Workshop on Big Data Benchmarks. Springer, 96--116.Google Scholar
- Apache Kafka. 2017. https://kafka.apache.org/. (2017).Google Scholar
- Andreas Kipf, Varun Pandey, Jan Böttcher, Lucas Braun, Thomas Neumann, and Alfons Kemper. 2017. Analytics on Fast Data: Main-Memory Database Systems vs Modern Streaming Systems. In 20th EDBT 2017, Venice, Italy, March 21-24, 2017.Google Scholar
- TPCx-BB kit. 2017. https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench. (2017).Google Scholar
- Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. Spark-Bench: A Comprehensive Benchmarking Suite for In Memory Data Analytic Platform Spark. In 12th ACM International Conference on Computing Frontiers. Google ScholarDigital Library
- Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks. In Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2014, London, United Kingdom, December 8-11, 2014. 69--78. Google ScholarDigital Library
- Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Stateful Scalable Stream Processing at LinkedIn. PVLDB 10, 12 (2017), 1634--1645. Google ScholarDigital Library
- Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale. 2016. SamzaSQL: Scalable Fast Data Management with Streaming SQL. In 2016 IEEE IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016.Google Scholar
- Anshu Shukla and Yogesh Simmhan. 2016. Benchmarking Distributed Stream Processing Platforms for IoT Applications. In 8th TPCTC 2016, New Delhi, India, Sept. 5-9, 2016. 90--106.Google Scholar
- Michael Stonebraker, Ugur Çetintemel, and Stanley B. Zdonik. 2005. The 8 requirements of real-time stream processing. SIGMOD Record 34, 4 (2005), 42--47. Google ScholarDigital Library
- Flink Streaming. 2017. https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/streaming.html. (2017).Google Scholar
- Spark Streaming. 2017. https://spark.apache.org/streaming/. (2017).Google Scholar
- Spark Structured Streaming. 2017. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html. (2017).Google Scholar
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy V. Ryaboy. 2014. Storm@twitter. In SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 147--156. Google ScholarDigital Library
- TPCx-BB. 2017. www.tpc.org/tpcx-bb/default.asp. (2017).Google Scholar
- Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: fault-tolerant streaming computation at scale. In ACM SIGOPS 24th SOSP '13, Farmington, PA, USA, November 3-6, 2013. Google ScholarDigital Library
Index Terms
- Adding Velocity to BigBench
Recommendations
CoreBigBench: Benchmarking big data core operations
DBTest '20: Proceedings of the workshop on Testing Database SystemsSignificant effort was put into big data benchmarking with focus on end-to-end applications. While covering basic functionalities implicitly, the details of the individual contributions to the overall performance are hidden. As a result, end-to-end ...
BigBench: towards an industry standard benchmark for big data analytics
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataThere is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to ...
ABench: Big Data Architecture Stack Benchmark
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance EngineeringDistributed big data processing and analytics applications demand a comprehensive end-to-end architecture stack consisting of big data technologies. However, there are many possible architecture patterns (e.g. Lambda, Kappa or Pipeline architectures) to ...
Comments