Abstract
When data centers employ the common and economical practice of upgrading subsets of nodes incrementally, rather than replacing or upgrading all nodes at once, they end up with clusters whose nodes have non-uniform processing capability, which we also call performance-heterogeneity. Popular frameworks supporting the effective MapReduce programming model for Big Data applications do not flexibly adapt to these environments. Instead, existing MapReduce frameworks, including Hadoop, typically divide data evenly among worker nodes, thereby inducing the well-known problem of stragglers on slower nodes. Our alternative MapReduce framework, called MARLA, divides each worker’s labor into sub-tasks, delays the binding of data to worker processes, and thereby enables applications to run faster in performance-heterogeneous environments. This approach does introduce overhead, however. We explore and characterize the opportunity for performance gains, and identify when the benefits outweigh the costs. Our results suggest that frameworks should support finer grained sub-tasking and dynamic data partitioning when running on some performance-heterogeneous clusters. Blindly taking this approach in homogeneous clusters can slow applications down. Our study further suggests the opportunity for cluster managers to build performance-heterogeneous clusters by design, if they also run MapReduce frameworks that can exploit them.
This work was supported in part by NSF grant CNS-0958501.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
MARLA stands for “MApReduce with adaptive Load balancing for heterogeneous and Load imbalAnced clusters.”
- 2.
We do not use the Fastest node configuration for this set of experiments.
References
Apache Hadoop. http://hadoop.apache.org
1000 Genomes: A Deep Catalog of Human Genetic Variation. http://www.1000genomes.org
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res. 20, 1297–1303 (2010)
Starr, D.L., Bloom, J.S., Brewer, J.M., Butler, N., Clein, C.: A map/reduce parallelized framework for rapidly classifying astrophysical transients. In: Astronomical Data Analysis Software and Systems XIX, Series, vol. 434. ASP Conference Series (2010)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, Series, OSDI 2008, pp. 29–42. USENIX Association, Berkeley (2008). http://dl.acm.org/citation.cfm?id=1855741.1855744
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Quin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: IPDPS Workshops, pp. 1–9 (2010)
The FutureGrid Resource Project: An XSEDE Resource Provider. https://portal.futuregrid.org/about
National Energy Research Scientific Computing Center. http://nersc.gov
Fadika, Z., Dede, E., Hartog, J., Govindaraju, M.: Marla: mapreduce for heterogeneous and load imbalanced clusters. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 49–56, May 2012
Fadika, Z., Dede, E., Govindaraju, M., Ramakrishnan, L.: Benchmarking mapreduce implementations for application usage scenarios. In: IEEE/ACM International Workshop on Grid Computing, pp. 90–97 (2011)
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: Tarazu: optimizing mapreduce on heterogeneous clusters. ACM SIGARCH Comput Archit. News 40(1), 61–74 (2012)
HDFS. http://hadoop.apache.org/docs/hdfs/r0.22.0/hdfs_design.html
Hartog, J., DelValle, R., Govindaraju, M., Lewis, M.: Configuring a mapreduce framework for performance-heterogeneous clusters. In: Proceedings of the 2013 IEEE Big Data 2014 Conference, Research Track, Series, BigData 2014, Anchorage, AL, USA (2014)
Nathuji, R., Isci, C., Gorbatov, E.: Exploiting platform heterogeneity for power efficient data centers. In: Fourth International Conference on Autonomic Computing, ICAC 2007, p. 5. IEEE (2007)
Fadika, Z., Dede, E., Govindaraju, M., Ramakrishnan, L.: Mariane: mapreduce implementation adapted for HPC environments. In: IEEE/ACM International Workshop on Grid Computing, pp. 82–89 (2011)
General Parallel File System. http://www-03.ibm.com/systems/software/gpfs
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hartog, J., DelValle, R., Govindaraju, M., Lewis, M.J. (2015). Performance Analysis of Adapting a MapReduce Framework to Dynamically Accommodate Heterogeneity. In: Hameurlain, A., Küng, J., Wagner, R., Sakr, S., Wang, L., Zomaya, A. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XX. Lecture Notes in Computer Science(), vol 9070. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46703-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-662-46703-9_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46702-2
Online ISBN: 978-3-662-46703-9
eBook Packages: Computer ScienceComputer Science (R0)