MapReduce Workload Modeling with Statistical Approach

Yang, Hailong; Luan, Zhongzhi; Li, Wenjun; Qian, Depei

doi:10.1007/s10723-011-9201-4

MapReduce Workload Modeling with Statistical Approach

Published: 04 January 2012

Volume 10, pages 279–310, (2012)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Hailong Yang¹,
Zhongzhi Luan¹,
Wenjun Li¹ &
…
Depei Qian¹

1157 Accesses
44 Citations
6 Altmetric
Explore all metrics

Abstract

Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: a Berkeley view of cloud computing. Technical Report No. UCB/EECS-2009–28, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)
Rimal, B., Jukan, A., Katsaros, D., Goeleven, Y.: Architectural requirements for cloud computing systems: an enterprise cloud approach. Journal of Grid Computing 9(1), 3–26 (2011)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Zheng, S., Chakka, P., Ning, Z., Antony, S., Hao, L., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of IEEE 26th International Conference on Data Engineering (ICDE) (2010)
Yahoo Developer Network: Yahoo! launches world’s largest Hadoop Production Application. Available online at http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-worlds-largest-production-hadoop/ (2008). Accessed on Nov. 2011
Pallis, G., Katsifodimos, A., Dikaiakos, M.: Searching for software on the EGEE infrastructure. Journal of Grid Computing 8(2), 281–304 (2010)
Article Google Scholar
Thain, D., Moretti, C., Hemmes, J.: Chirp: a practical global filesystem for cluster and Grid computing. Journal of Grid Computing 7(1), 51–72 (2009)
Article Google Scholar
McClatchey, R., Anjum, A., Stockinger, H., Ali, A., Willers, I., Thomas, M.: Data Intensive and Network Aware (DIANA) Grid scheduling. Journal of Grid Computing 5(1), 43–64 (2007)
Article Google Scholar
Yu, C., Marinescu, D.: Algorithms for divisible load scheduling of data-intensive applications. Journal of Grid Computing 8(1), 133–155 (2010)
Article MATH Google Scholar
Cai, Z., Kumar, V., Schwan, K.: IQ-Paths: predictably high performance data streams across dynamic network overlays. Journal of Grid Computing 5(2), 129–150 (2007)
Article Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Randy, H., Katz, I.S.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI) (2008)
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user MapReduce clusters. Technical Report No. UCB/ EECS-2009–55, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating systems Principles (SOSP) (2009)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI) (2010)
Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. 3(1–2), 472–483 (2010)
Google Scholar
Chen, Y., Ganapathi, A.S., Fox, A., Katz, R.H., Patterson, D.A.: Statistical workloads for energy efficient MapReduce. Technical Report No. UCB/EECS-2010–6, Electrical Engineering and Computer Sciences, University of California at Berkeley (2010)
Apache Hadoop: Gridmix. Available online at http://hadoop.apache.org/mapreduce/docs/current/gridmix.html (2010). Accessed on Nov. 2011
Apache Hive: Hive performance benchmarks. Available online at https://issues.apache.org/jira/browse/HIVE-396 (2010). Accessed on Nov. 2011
Shengsheng, H., Jie, H., Jinquan, D., Tao, X., Bo, H.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of IEEE 26th International Conference on Data Engineering Workshops (ICDEW) (2010)
Apache Hadoop MapReduce: Mumak: Map-Reduce simulator. Available online at https://issues.apache.org/jira/browse/MAPREDUCE-728 (2009). Accessed on Nov. 2011
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of mapreduce setups. In: Proceedings of the ACM Workshop on Large-Scale System and Application Performance (2009)
Hammoud, S., Maozhen, L., Yang, L., Alham, N.K., Zelong, L.: MRSim: a discrete event based MapReduce simulator. In: Proceedings of International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) (2010)
Babu, S.: Towards automatic optimization of MapReduce programs. In: Proceedings of the ACM Symposium on Cloud computing (SoCC) (2010)
Koehler, M., Kaniovskyi, Y., Benkner, S.: An adaptive framework for the execution of data-intensive MapReduce applications in the cloud. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW) (2011)
Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. 4(6), 385–396 (2011)
Google Scholar
Rizvandi, N.B., Zomaya, A.Y., Boloori, A.J., Taheri, J.: Preliminary results: modeling relation between total execution time of MapReduce applications and number of mappers/reducers. Technical Report No. 679, Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney (2011)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Shvachko, K., Hairong, K., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010)
Apache Nutch: Nutch homepage. Available online at http://nutch.apache.org/ (2010). Accessed on Nov. 2011
SmartFrog Organization: SmartFrog homepage. Available online at http://wiki.smartfrog.org/wiki/display/sf/SmartFrog+Home (2007). Accessed on Nov. 2011
Apache Mahout: Mahout homepage. Available online at http://mahout.apache.org/ (2010). Accessed on Nov. 2011
Apache Hadoop: Hadoop Wiki Power-By. Available online at http://wiki.apache.org/hadoop/PoweredBy (2010). Accessed on Nov. 2011
Farnham, I.M., Johannesson, K.H., Singh, A.K., Hodge, V.F., Stetzenbach, K.J.: Factor analytical approaches for evaluating groundwater trace element chemistry data. Anal. Chim. Acta 490(1–2), 123–138 (2003)
Article Google Scholar
Manly, B.F.: Multivariate Statistical methods: A Primer. Chapman & Hall, Ltd., London (1986)
Google Scholar
Vapnik, V., Golowich, S.E., Smola, A.J.: Support vector method for function approximation, regression estimation and signal processing. In: Proceedings of Conference on Neural Information Processing Systems (NIPS) (1996)
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)
Article MathSciNet Google Scholar
Inspur Company: Inspur homepage. Available online at http://en.inspur.com/ (2006). Accessed on Nov. 2011
Beihang University NICC: The network information and computing center. Available online at http://nic.buaa.edu.cn/ (2010). Accessed on Nov. 2011

Download references

Author information

Authors and Affiliations

Sino-German Joint Software Institute, The State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, China
Hailong Yang, Zhongzhi Luan, Wenjun Li & Depei Qian

Authors

Hailong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Luan
View author publications
You can also search for this author in PubMed Google Scholar
Wenjun Li
View author publications
You can also search for this author in PubMed Google Scholar
Depei Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, H., Luan, Z., Li, W. et al. MapReduce Workload Modeling with Statistical Approach. J Grid Computing 10, 279–310 (2012). https://doi.org/10.1007/s10723-011-9201-4

Download citation

Received: 08 August 2011
Accepted: 21 December 2011
Published: 04 January 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10723-011-9201-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce Workload Modeling with Statistical Approach

Abstract

Access this article

Similar content being viewed by others

Comparison and Improvement of Hadoop MapReduce Performance Prediction Models in the Private Cloud

MapReduce Algorithms for Big Data Analysis

A Priori Study on Factors Affecting MapReduce Performance in Cloud-Based Environment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MapReduce Workload Modeling with Statistical Approach

Abstract

Access this article

Similar content being viewed by others

Comparison and Improvement of Hadoop MapReduce Performance Prediction Models in the Private Cloud

MapReduce Algorithms for Big Data Analysis

A Priori Study on Factors Affecting MapReduce Performance in Cloud-Based Environment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation