Abstract
Hadoop has been developed as a solution for performing large-scale data-parallel applications in Cloud computing. A Hadoop system can be described based on three factors: cluster, workload, and user. Each factor is either heterogeneous or homogeneous, which reflects the heterogeneity level of the Hadoop system. This paper studies the effect of heterogeneity in each of these factors on the performance of Hadoop schedulers. Three schedulers which consider different levels of Hadoop heterogeneity are used for the analysis: FIFO, Fair sharing, and COSHH (Classification and Optimization based Scheduler for Heterogeneous Hadoop). Performance issues are introduced for Hadoop schedulers, and experiments are provided to evaluate these issues. The reported results suggest guidelines for selecting an appropriate scheduler for Hadoop systems. Finally, the proposed guidelines are evaluated in different Hadoop systems.
Similar content being viewed by others
References
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008). doi:10.1145/1327452.1327492
Sankar, K., Bouchard, S.A.: Enterprise Web 2.0. Cisco Press (2009)
Rasooli, A., Down, D.G.: A hybrid scheduling approach for scalable heterogeneous Hadoop systems. In: Proceedings of the 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS12), Salt Lake City 2012
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, Paris, 265–278 April 2010. doi:10.1145/1755913.1755940
Rasooli, A., Down, D.G.: An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems. In: Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’11, IBM Corporation, Toronto, 30–44 2011. http://dl.acm.org/citation.cfm?id=2093889.2093893
Sandholm, T., Lai, K.: Dynamic proportional share scheduling in Hadoop. In: Proceedings of the 15th Workshop on Job Scheduling Strategies for Parallel Processing, Heidelberg, 110–131 2010
Chen, Y., Ganapathi, A., Griffith, R., Katz, R.H.: The case for evaluating MapReduce performance using workload suites. In: Proceedings of the 19th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Washington, 390–399 2011. doi:10.1109/MASCOTS.2011
Apache: Hadoop on demand documentation. http://hadoop.apache.org/common/docs/r0.17.2/hod.html (2007). Accessed 30 Nov 2010
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, 24–24 2011. http://dl.acm.org/citation.cfm?id=1972457.1972490
Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: A cross-industry study of MapReduce workloads. In: Proceedings of the International Conference on Very Large Data Bases (VLDB) Endowment. 5(12) 1802–1813 2012. http://dl.acm.org/citation.cfm?id=2367502.2367519
Hammoud, S., Li, M., Liu, Y., Alham, N.K., Liu, Z.: MRSim: A discrete event based MapReduce simulator. In: Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2010), IEEE, pp. 2993–2997 2010
Gottfrid, D., Self-service: Prorated super computing fun. http://tinyurl.com/2pjh5n (2009)
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user MapReduce clusters. Tech. Rep. UCB/EECS-2009-55, EECS Department, University of California, Berkeley 2009. http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.html
Aboulnaga, A., Wang, Z., Zhang, Z.Y.: Packing the most onto your Cloud. In: Proceedings of the First International Workshop on Cloud Data Management, 25–28 2009. doi:10.1145/1651263.1651268
Yang, H., Luan, Z., Li, W., Qian, D.: MapReduce workload modeling with statistical approach. J. Grid Comput 10(2), 279–310 (2012). doi:10.1007/s10723-011-9201-4
Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput 10(1), 47–68 (2012). doi:10.1007/s10723-012-9204-9
Rimal, B., Jukan, A., Katsaros, D., Goeleven, Y.: Architectural requirements for cloud computing systems: an enterprise cloud approach. J. Grid Comput 9(1), 3–26 (2011). doi:10.1007/s10723-010-9171-y
Shamsi, J., Khojaye, M., Qasmi, M.: Data-intensive cloud computing: requirements, expectations, challenges, and solutions. J. Grid Comput 9(1), 3–26 (2011). doi:10.1007/s10723-010-9171-y
Jones, M, Self-service: Scheduling in Hadoop: an introduction to the pluggable scheduler framework. http://www.ibm.com/developerworks/library/os-hadoop-scheduling/ (2011)
White, T.: Hadoop: The Definitive Guide, 3rd edn. Book, O’Reilly Media. ISBN-10:1449311520
He-yang, K., Qun, Y., Li-song, W., Xi, D.: Improved delay-scheduler algorithm in homogeneous Hadoop cluster. In: Application Research of Computers, 5, pp. 1397-1401 (2013)
Ahmad, F., Chakradhar, S., Raghunathan, A., Vijaykumar, T.: Tarazu: Optimizing MapReduce on heterogeneous clusters. ACM SIGARCH Comput. Architure News 40(1), 61–74 (2012). doi:10.1145/2189750.2150984
Zaharia, M., Konwinski, A., Joseph, A., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, 29-42 2008
Zaharia, M., Konwinski, A., Joseph, A., Katz, R., Stoica, I.: Big data processing with Hadoop MapReduce in cloud systems. (IJ-CLOSER) Int. J. Cloud Comput. Serv. Sci 2(1), 16–27 (2013)
Rasooli, A., Down, D.G.: COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems, to appear In: Future Generation Computer Systems. doi:10.1016/j.future.2014.01.002
Rasooli, A.: Improving scheduling in heterogeneous Grid and Hadoop systems, Ph.D. thesis, McMaster University, Hamilton, July 2013
Agarwal, S., Stoica, I.: Chronos: a predictive task scheduler for MapReduce, Tech. rep., EECS Department, University of California, Berkeley, December 2010 http://www.cs.berkeley.edu/~sameerag/
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rasooli, A., Down, D.G. Guidelines for Selecting Hadoop Schedulers Based on System Heterogeneity. J Grid Computing 12, 499–519 (2014). https://doi.org/10.1007/s10723-014-9299-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-014-9299-2