Skip to main content
Log in

Analytical Performance Models for MapReduce Workloads

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

MapReduce is a currently popular programming model to support parallel computations on large datasets. Among the several existing MapReduce implementations, Hadoop has attracted a lot of attention from both industry and research. In a Hadoop job, map and reduce tasks coordinate to produce a solution to the input problem, exhibiting precedence constraints and synchronization delays that are characteristic of a pipeline communication between maps (producers) and reduces (consumers). We here address the challenge of designing analytical models to estimate the performance of MapReduce workloads, notably Hadoop workloads, focusing particularly on the intra-job pipeline parallelism between map and reduce tasks belonging to the same job. We propose a hierarchical model that combines a precedence graph model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time, throughput and resource utilization. We validate our solution against a queuing network simulator and a real setup in various scenarios, finding very close agreement in both cases. In particular, our model produces estimates of average job response time that deviate from measurements of a real setup by less than 15 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Apache Software Foundation, Powered by Hadoop. URL http://wiki.apache.org/hadoop/PoweredBy. Access date: 1 July 2012 (2012)

  2. Apache Software Foundation, Official Apache Hadoop Website. URL http://hadoop.apache.org/. Accessed date: 1 July 2012 (2012)

  3. Berlińska J., Drozdowski M.: Scheduling divisible MapReduce computations. J. Parallel Distrib. Comput. 71(3), 450–459 (2011)

    Article  Google Scholar 

  4. Chen, Y., Ganapathi, A., Griffith R., Katz, R.: The case for evaluating MapReduce performance using workload suites. In: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Singapore, pp. 390–399 (2011)

  5. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears R.: MapReduce Online Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), San Jose, California, pp. 21–35 (2010)

  6. Dean, J., Ghemawat, S., MapReduce : Simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation (OSDI), San Francisco, California, pp. 137–150 (2004)

  7. Dean J., Ghemawat S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  8. Ganapathi, A.: Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning. Technical Report UCB/EECS-2009-181. EECS Department, University of California, Berkeley (2009)

  9. Ganapathi, A., Kuno, H., Dayal, U., Wiener, J., Fox, A., Jordan, M., Patterson, D. : Predicting multiple metrics for queries: better decisions enabled by machine learning. In: Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE), Shanghai, China, pp. 592–603 (2009)

  10. Herodotou, H.: Hadoop Performance Models. Technical Report CS-2011-05. Computer Science Department, Duke University. URL http://arxiv.org/abs/1106.0940 (2011)

  11. Jain R.: The Art of Computer Systems Performance Analysis—Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, London (1991)

    MATH  Google Scholar 

  12. Jiang D.R., Ooi B.C., Shi L., Wu S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow 3(1–2), 472–483 (2010)

    Google Scholar 

  13. Jonkers H.: Queueing Models of Parallel Applications: The Glamis Methodology, Computer Performance Evaluation: Modeling Techniques & Tools (LNCS 794), pp. 123–138. Springer, Berlin (1994)

    Book  Google Scholar 

  14. Kim S., Won J., Han H., Eom H., Yeom H.Y.: Improving hadoop performance in intercloud environments ACM SIGMETRICS. Perform. Eval. Rev. 39(3), 107–109 (2011)

    Article  Google Scholar 

  15. Krevat, E., Shiran, T., Anderson, E., Tucek, J., Wylie, J.J. , Ganger, G.R.: Applying Performance Models to Understand Data-intensive Computing Efficiency. Technical Report CMU-PDL-10-108. Carnegie Mellon University, Pittsburgh (2010)

  16. Kruskal C.P., Weiss A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10), 1001–1016 (1985)

    Article  MATH  Google Scholar 

  17. Lavenberg S., Reiser M.: Stationary state probabilities at arrival instants for closed queueing networks with multiple types of customers. J. Appl. Probab. 17(4), 1048–1061 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  18. Lee K.H., Lee Y.J., Choi H., Chung Y.D., Moon B.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Record J. 40(4), 11–20 (2011)

    Article  Google Scholar 

  19. Liang D.R., Tripathi S. K.: On performance prediction of parallel computations with precedent constraints. IEEE Trans. Parallel Distrib. Syst. 11(5), 491–508 (2000)

    Article  Google Scholar 

  20. Little J.: A proof for the queuing formula: L = λ W. Oper. Res. 9(3), 383–387 (1961)

    Article  MathSciNet  MATH  Google Scholar 

  21. Mak V.W., Lundstrom S.F.: Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1(3), 257–260 (1990)

    Article  Google Scholar 

  22. Menasce, D., Dowdy, L., Almeida, V.: Performance by Design: Computer Capacity Planning By Example. Prentice Hall PTR (2004)

  23. Morton, K., Balazinska, M., Grossman, D.: ParaTimer: a progress indicator for MapReduce DAGs. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), Indianapolis, Indiana, pp. 507–518 (2010)

  24. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD), Providence, Rhode Island, pp. 165–178 (2009)

  25. Reiser M., Lavenberg S.S.: Mean-value analysis of closed multichain queuing networks. J. ACM 27(2), 313–322 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  26. Salza, S., Lavenberg, S.S.: Approximating response time distributions in closed queueing network models of computer performance. In: Proceedings Performance, North Holland, Amsterdam, pp. 133–145 (1981)

  27. Thomasian A., Bay P.F.: Analytic queueing network models for parallel processing of task systems. IEEE Trans. Comput. 35(12), 1045–1054 (1986)

    Article  Google Scholar 

  28. Trivedi K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications. Prentice Hall PTR, Upper Saddle River (1882)

    Google Scholar 

  29. Varki, E.: Mean value technique for closed fork-join networks. In: Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Atlanta, Georgia, pp. 103–112 (1999)

  30. Vianna, E.,Comarela, G., Pontes, T., Almeida, J., Almeida, V., Wilkinson, K., Kuno, H., Dayal, U.: Modeling the performance of the Hadoop online prototype. In: Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Vitória, Brazil, pp. 152–159 (2011)

  31. Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: Proceedings of the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), Imperial College London, UK, pp. 1–11 (2009)

  32. Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of MapReduce setups. In: Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance (LSAP), Munich, Germany, pp. 19–26 (2009)

  33. Weng N., Wolf T.: Analytic modeling of network processors for parallel workload mapping. ACM Trans. Embed. Comput. Syst. 8(3), 18:1–18:29 (2009)

    Article  Google Scholar 

  34. White T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale. 2nd edn. O’Reilly Media, Sebastopol (2011)

    Google Scholar 

  35. Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-Merge: simplified relational data processing on LargeClusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD), Beijing, China, pp. 1029–1040 (2007)

  36. Yang, X., Sun, J.: An Analytical performance model of MapReduce. In: Proceedings of the 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), Beijing, China, pp. 306–310 (2011)

  37. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI), San Diego, California, pp. 29–42 (2008)

  38. Zahorjan, J.: The Approximate Solution of Large Queueing Network Models, PhD. Thesis, University of Toronto, Canada (1980)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emanuel Vianna.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vianna, E., Comarela, G., Pontes, T. et al. Analytical Performance Models for MapReduce Workloads. Int J Parallel Prog 41, 495–525 (2013). https://doi.org/10.1007/s10766-012-0227-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-012-0227-4

Keywords

Navigation