Skip to main content
Log in

MapReduce Workload Modeling with Statistical Approach

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: a Berkeley view of cloud computing. Technical Report No. UCB/EECS-2009–28, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)

  2. Rimal, B., Jukan, A., Katsaros, D., Goeleven, Y.: Architectural requirements for cloud computing systems: an enterprise cloud approach. Journal of Grid Computing 9(1), 3–26 (2011)

    Article  Google Scholar 

  3. Thusoo, A., Sarma, J.S., Jain, N., Zheng, S., Chakka, P., Ning, Z., Antony, S., Hao, L., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of IEEE 26th International Conference on Data Engineering (ICDE) (2010)

  4. Yahoo Developer Network: Yahoo! launches world’s largest Hadoop Production Application. Available online at http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-worlds-largest-production-hadoop/ (2008). Accessed on Nov. 2011

  5. Pallis, G., Katsifodimos, A., Dikaiakos, M.: Searching for software on the EGEE infrastructure. Journal of Grid Computing 8(2), 281–304 (2010)

    Article  Google Scholar 

  6. Thain, D., Moretti, C., Hemmes, J.: Chirp: a practical global filesystem for cluster and Grid computing. Journal of Grid Computing 7(1), 51–72 (2009)

    Article  Google Scholar 

  7. McClatchey, R., Anjum, A., Stockinger, H., Ali, A., Willers, I., Thomas, M.: Data Intensive and Network Aware (DIANA) Grid scheduling. Journal of Grid Computing 5(1), 43–64 (2007)

    Article  Google Scholar 

  8. Yu, C., Marinescu, D.: Algorithms for divisible load scheduling of data-intensive applications. Journal of Grid Computing 8(1), 133–155 (2010)

    Article  MATH  Google Scholar 

  9. Cai, Z., Kumar, V., Schwan, K.: IQ-Paths: predictably high performance data streams across dynamic network overlays. Journal of Grid Computing 5(2), 129–150 (2007)

    Article  Google Scholar 

  10. Zaharia, M., Konwinski, A., Joseph, A.D., Randy, H., Katz, I.S.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI) (2008)

  11. Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user MapReduce clusters. Technical Report No. UCB/ EECS-2009–55, Electrical Engineering and Computer Sciences, University of California at Berkeley (2009)

  12. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating systems Principles (SOSP) (2009)

  13. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI) (2010)

  14. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow. 3(1–2), 472–483 (2010)

    Google Scholar 

  15. Chen, Y., Ganapathi, A.S., Fox, A., Katz, R.H., Patterson, D.A.: Statistical workloads for energy efficient MapReduce. Technical Report No. UCB/EECS-2010–6, Electrical Engineering and Computer Sciences, University of California at Berkeley (2010)

  16. Apache Hadoop: Gridmix. Available online at http://hadoop.apache.org/mapreduce/docs/current/gridmix.html (2010). Accessed on Nov. 2011

  17. Apache Hive: Hive performance benchmarks. Available online at https://issues.apache.org/jira/browse/HIVE-396 (2010). Accessed on Nov. 2011

  18. Shengsheng, H., Jie, H., Jinquan, D., Tao, X., Bo, H.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of IEEE 26th International Conference on Data Engineering Workshops (ICDEW) (2010)

  19. Apache Hadoop MapReduce: Mumak: Map-Reduce simulator. Available online at https://issues.apache.org/jira/browse/MAPREDUCE-728 (2009). Accessed on Nov. 2011

  20. Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of mapreduce setups. In: Proceedings of the ACM Workshop on Large-Scale System and Application Performance (2009)

  21. Hammoud, S., Maozhen, L., Yang, L., Alham, N.K., Zelong, L.: MRSim: a discrete event based MapReduce simulator. In: Proceedings of International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) (2010)

  22. Babu, S.: Towards automatic optimization of MapReduce programs. In: Proceedings of the ACM Symposium on Cloud computing (SoCC) (2010)

  23. Koehler, M., Kaniovskyi, Y., Benkner, S.: An adaptive framework for the execution of data-intensive MapReduce applications in the cloud. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW) (2011)

  24. Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for MapReduce programs. Proc. VLDB Endow. 4(6), 385–396 (2011)

    Google Scholar 

  25. Rizvandi, N.B., Zomaya, A.Y., Boloori, A.J., Taheri, J.: Preliminary results: modeling relation between total execution time of MapReduce applications and number of mappers/reducers. Technical Report No. 679, Center for Distributed and High Performance Computing, School of Information Technologies, University of Sydney (2011)

  26. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  27. Shvachko, K., Hairong, K., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (2010)

  28. Apache Nutch: Nutch homepage. Available online at http://nutch.apache.org/ (2010). Accessed on Nov. 2011

  29. SmartFrog Organization: SmartFrog homepage. Available online at http://wiki.smartfrog.org/wiki/display/sf/SmartFrog+Home (2007). Accessed on Nov. 2011

  30. Apache Mahout: Mahout homepage. Available online at http://mahout.apache.org/ (2010). Accessed on Nov. 2011

  31. Apache Hadoop: Hadoop Wiki Power-By. Available online at http://wiki.apache.org/hadoop/PoweredBy (2010). Accessed on Nov. 2011

  32. Farnham, I.M., Johannesson, K.H., Singh, A.K., Hodge, V.F., Stetzenbach, K.J.: Factor analytical approaches for evaluating groundwater trace element chemistry data. Anal. Chim. Acta 490(1–2), 123–138 (2003)

    Article  Google Scholar 

  33. Manly, B.F.: Multivariate Statistical methods: A Primer. Chapman & Hall, Ltd., London (1986)

    Google Scholar 

  34. Vapnik, V., Golowich, S.E., Smola, A.J.: Support vector method for function approximation, regression estimation and signal processing. In: Proceedings of Conference on Neural Information Processing Systems (NIPS) (1996)

  35. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)

    Article  MathSciNet  Google Scholar 

  36. Inspur Company: Inspur homepage. Available online at http://en.inspur.com/ (2006). Accessed on Nov. 2011

  37. Beihang University NICC: The network information and computing center. Available online at http://nic.buaa.edu.cn/ (2010). Accessed on Nov. 2011

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hailong Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, H., Luan, Z., Li, W. et al. MapReduce Workload Modeling with Statistical Approach. J Grid Computing 10, 279–310 (2012). https://doi.org/10.1007/s10723-011-9201-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-011-9201-4

Keywords

Navigation