The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

Jia, Zhen; Zhou, Runlin; Zhu, Chunge; Wang, Lei; Gao, Wanling; Shi, Yingjie; Zhan, Jianfeng; Zhang, Lixin

doi:10.1007/978-3-642-53974-9_5

Zhen Jia^19,20,
Runlin Zhou²¹,
Chunge Zhu²¹,
Lei Wang^19,20,
Wanling Gao^19,20,
Yingjie Shi¹⁹,
Jianfeng Zhan¹⁹ &
…
Lixin Zhang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8163))

Included in the following conference series:

2179 Accesses
8 Citations

Abstract

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications—an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Revisiting Benchmarking Principles and Methodologies for Big Data Benchmarking

Application-Level Benchmarking of Big Data Systems

Big Data Benchmark Compendium

References

http://hadoop.apache.org/
Performance counters for linux, https://perf.wiki.kernel.org/index.php/Main_Page
Sort benchmark home page, http://sortbenchmark.org/
Apacible, J., Draves, R., et al.: Minutesort with flat datacenter storage. Technical report, Microsoft Research (2012)
Google Scholar
Barroso, L., Hölzle, U.: The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture 4(1), 1–108 (2009)
Article Google Scholar
Baru, C., et al.: Benchmarking big data systems and the bigdata top100 list. Big Data 1(1), 60–64 (2013)
Article Google Scholar
Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the direction for big data benchmark standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013)
Chapter Google Scholar
Buros, W.M., et al.: Understanding systems and architecture for big data. IBM Research Report (2013)
Google Scholar
Chen, Y.: We Don’t Know Enough to make a Big Data Benchmark Suite. In: Workshop on Big Data Benchmarking (2012)
Google Scholar
Chen, Y., Raab, F., Katz, R.H.: From tpc-c to big data benchmarks: A functional workload model. Technical Report UCB/EECS-2012-174, EECS Department, University of California, Berkeley (July 2012)
Google Scholar
Chen, Z., Jianfeng, Z., Zhen, J., Lixin, Z.: Characterizing os behavior of scale-out data center workloads. In: The Seventh Annual Workshop on the Interaction Amongst Virtualization, Operating Systems and Computer Architecture, WIVOSCA 2013 (2013)
Google Scholar
Cook, S.A., Reckhow, R.A.: Time bounded random access machines. Journal of Computer and System Sciences 7(4), 354–375 (1973)
Article MATH MathSciNet Google Scholar
Ferdman, M., et al.: Clearing the clouds: A study of emerging workloads on modern hardware. Architectural Support for Programming Languages and Operating Systems (2012)
Google Scholar
Gao, W., et al.: A benchmark suite for big data systems. In: The 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013) (2013), Tutorial http://prof.ict.ac.cn/HPCA/BigDataBench.pdf
Gao, W., et al.: Bigdatabench: a big data benchmark suite from web search engines. In: The Third Workshop on Architectures and Systems for Big Data (ASBD 2013) in Conjunction with the 40th International Symposium on Computer Architecture (May 2013)
Google Scholar
Ghazal, A., et al.: Bigbench: Towards an industry standard benchmark for big data analytics. In: ACM SIGMOD Conference (2013)
Google Scholar
Holyer, I.: Computational complexity (1984)
Google Scholar
Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE (2013)
Google Scholar
Jia, Z., Zhan, J., Wang, L., Zhang, L., et al.: Hvcbench: A benchmark suite for data center. The 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013) (2013), Tutorial Technical Report http://prof.ict.ac.cn/HPCA/HPCA_Tutorial_HVC_4-jiazhen.pdf
Lotfi-Kamran, P., Grot, B., Ferdman, M., Volos, S., Kocberber, O., Picorel, J., Adileh, A., Jevdjic, D., Idgunji, S., Ozer, E., et al.: Scale-out processors. In: Proceedings of the 39th International Symposium on Computer Architecture, pp. 500–511. IEEE Press (2012)
Google Scholar
Luo, C., Zhan, J., Jia, Z., Wang, L., Lu, G., Zhang, L., Xu, C., Sun, N.: Cloudrank-d: benchmarking and ranking cloud computing systems for data processing applications. Frontiers of Computer Science 6(4), 347–362 (2012)
MathSciNet Google Scholar
Rajaraman, A.: More data usually beats better algorithms. Datawocky Blog (2008)
Google Scholar
Sang, B., Zhan, J., Lu, G., Wang, H., Xu, D., Wang, L., Zhang, Z., Jia, Z.: Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems 23(6), 1159–1167 (2012)
Article Google Scholar
Skiena, S.S.: The algorithm design manual: with 72 figures, vol. 1. Telos Press (1998)
Google Scholar
Wang, L., Zhan, J., Shi, W., Liang, Y.: In cloud, can scientific communities benefit from the economies of scale? IEEE Transactions on Parallel and Distributed Systems 23(2), 296–303 (2012)
Article Google Scholar
White, T.: Hadoop: The definitive guide. O’Reilly Media (2012)
Google Scholar
Yelick, K.: Single processor machines: Memory hierarchies and processor features
Google Scholar
Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM (2010)
Google Scholar
Zhan, J., Wang, L., Li, X., Shi, W., Weng, C., Zhang, W., Zang, X.: Cost-aware cooperative resource provisioning for heterogeneous workloads in data centers. IEEE Transactions on Computers
Google Scholar
Zhan, J., Zhang, L., Sun, N., Wang, L., Jia, Z., Luo, C.: High volume throughput computing: Identifying and characterizing throughput oriented workloads in data centers. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 1712–1721. IEEE (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China
Zhen Jia, Lei Wang, Wanling Gao, Yingjie Shi, Jianfeng Zhan & Lixin Zhang
University of Chinese Academy of Sciences, China
Zhen Jia, Lei Wang & Wanling Gao
National Computer Network Emergency Response Technical Team Coordination Center of China, China
Runlin Zhou & Chunge Zhu

Authors

Zhen Jia
View author publications
You can also search for this author in PubMed Google Scholar
Runlin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Chunge Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wanling Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yingjie Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electric and Computer Science, University of Toronto, 10 King’s College Road, SFB 540, M5S 3G4, Toronto, ON, Canada
Tilmann Rabl & Hans-Arno Jacobsen &
Server Technologies, Oracle Corporation, 500 Oracle Parkway, 94065, Redwood Shores, CA, USA
Meikel Poess
Supercomputer Center, University of California San Diego, 9500 Gilman Drive, 92093-0505, La Jolla, CA, USA
Chaitanya Baru

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, Z. et al. (2014). The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-53974-9_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53973-2
Online ISBN: 978-3-642-53974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics