Abstract
Big data overwhelmed industries and research sectors. Reliable decision making is always a challenging task, which requires cost-effective big data processing tools. Hadoop MapReduce is being used to store and process huge volume of data in a distributed environment. However, due to huge capital investment and lack of expertise to set up an on-premise Hadoop cluster, big data users seek cloud-based MapReduce service over the Internet. Mostly, MapReduce on a cluster of virtual machines is offered as a service for a pay-per-use basis. Virtual machines in MapReduce virtual cluster reside in different physical machines and co-locate with other non-MapReduce VMs. This causes to share IO resources such as disk and network bandwidth, leading to congestion as most of the MapReduce jobs are disk and network intensive. Especially, the shuffle phase in MapReduce execution sequence consumes huge network bandwidth in a multi-tenant environment. This results in increased job latency and bandwidth consumption cost. Therefore, it is essential to minimize the amount of intermediate data in the shuffle phase rather than supplying more network bandwidth that results in increased service cost. Considering this objective, we extended multi-level per node combiner for a batch of MapReduce jobs to improve makespan. We observed that makespan is improved up to 32.4\(\%\) by minimizing the number of intermediate data in shuffle phase when compared to classical schedulers with default combiners.
Similar content being viewed by others
References
Anjos JCS, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR (2015) MRA++: scheduling and data placement on MapReduce for heterogeneous environments. Fut Gener Comput Syst 42:22–35
Bardhan S, Menasc D (2013) The anatomy of Mapreduce jobs, scheduling, and performance challenges. In: Conference of the computer measurement group
Berlińska J, Drozdowski M (2018) Comparing load-balancing algorithms for MapReduce under Zipfian data skews. Parallel Comput 72:14–28
Blanca A, Shin SW (2013) Optimizing network usage in mapreduce scheduling. UC Berkely
Cassales Guilherme W, Charao AS, Kirsch-Pinheiro M, Souveyet C, Steffenel L-A (2016) Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling. J Ambient Intell Hum Comput 7(3):333–345
Chen C, Lin J, Kuo S (2015a) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):1–14
Chen Q, Yao J, Xiao Z (2015b) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533
Chen W, Kumara BTGS, Paik I, Li Z (2015c) Topology-aware heuristic data allocation algorithm for big data infrastructure. In: IEEE first international conference on big data computing service and applications, pp 353–360
Chen W, Paik I, Li Z (2016) Tology-aware optimal data placement algorithm for network traffic optimization. IEEE Trans Comput 65(8):2603–2617
Chowdhury M, Zaharia M, Ma J, Jordan MI, Stoica I (2011) Managing data transfers in computer clusters with orchestra. Proc ACM SIGCOMM Conf 41(4):98–109
Costa P, Donnelly A, Rowstron A, Shea GO (2012) Camdoop: exploiting in-network aggregation for big data applications. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, pp 1–14
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Sixth symposium on operating system design and implementation, pp 137–149
Dean J, Ghemawat S (2005a) Hadoop MapReduce fair scheduler. Hadoop MapReduce capacity scheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. Accessed July 2019
Dean J, Ghemawat S (2005b) Hadoop MapReduce fair scheduler. Hadoop MapReduce fair scheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. Accessed July 2019
Gufler A, BenjaGufler B. Augsten, Reiser N, Kemper A (2011) Handling data skew in MapReduce. In: Proceedings of the 1st international conference on cloud computing and services science, pp 574–583
Guo D, Xie J, Zhou X, Zhu X, Wei W, Luo X (2015) Exploiting efficient and scalable shuffle transfers in future data center networks. IEEE Trans Parallel Distrib Syst 26(4):997–1009
Guo Y, Rao J, Jiang C (2017a) Moving Hadoop into the cloud with flexible slot management and speculative execution. IEEE Trans Parallel Distrib Syst 28(3):798–812
Guo Y, Rao J, Cheng D, Member S (2017b) iShuffle: improving Hadoop performance with shuffle-on-write. IEEE Trans Parallel Distrib Syst 28(6):1649–1662
Jean-Pierre C, Leo L, Riham E, Wei S (2018) MCSA: a multi-criteria shuffling algorithm for the MapReduce framework. In: IEEE SmartWorld, ubiquitous intelligence and computing, advanced and trusted computed, scalable computing and communications, cloud and big data computing, internet of people and smart city innovation
Jeyaraj R, Ananthanarayana VS (2018a) Dynamic performance aware reduce task scheduling in MapReduce on virtualized environment. In: IEEE 16th international conference on software engineering research, management and applications, pp 211–218
Jeyaraj R, Ananthanarayana VS (2018b) Multi-level per node combiner (MLPNC) to minimize mapreduce job latency on virtualized environment. In: Proceedings of the ACM symposium on applied computing, pp 167–174
Ke H, Li P, Guo S, Stojmenovic I (2015) Aggregation on the fly: reducing traffic for big data in the cloud. IEEE Netw 29(5):17–23
Lee CW, Hsieh KY, Hsieh SY, Hsiao HC (2014) A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res 1:14–22
Lee WH, Jun HG, Kim HJ (2015) Hadoop Mapreduce performance enhancement using in-node combiners. Int J Comput Sci Inf Technol 7(5):1–17
Liang F, Lau FCM (2016) BAShuffler: maximizing network bandwidth utilization in the shuffle of YARN. In: Proceedings of the 25th ACM international symposium on high-performance parallel and distributed computing, pp 281–284
Liao J, Zhang L, Li T, Wang J, Qi Q (2016) Efficient and fair scheduler of multiple resources for MapReduce system. IET Softw 10:182–188
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84
Liu Z, Member S, Zhang Q, Member S, Ahmed R (2016) Dynamic resource allocation for MapReduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317
Lu W, Chen L, Wang L, Yuan H, Xing W, Yang Y (2018) NPIY: a novel partitioner for improving mapreduce performance. J Vis Lang Comput 46:1–11
Ming-Chang Lee RY, Lin Jia-Chun (2016) Hybrid job-driven scheduling in virtual MapReduce cluster. IEEE Trans Parallel Distrib Syst 27(6):1687–1699
Mohammad AI, Amir MR, Saeed S (2018) A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf Sci 501:662–679
Myung J, Shim J, Yeon J, Lee SG (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299
Rathinaraja J, Ananthanarayana VS, Anand P (2019) Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment. J Supercomput 75(11):7520–7549
Shi W, Wang Y, Corriveau JP, Niu B, Croft WL, Peng M (2015) Smart shuffling in MapReduce: a solution to balance network traffic and workloads. In: IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 35–44
Shi Y, Zhang K, Cui L, Liu L, Zheng Y, Zhang S (2016) MapReduce short jobs optimization based on resource reuse. Microprocess Microsyst 47:178–187
Ubarhande V, Popescu AM, González-Vélez H (2015) Novel data-distribution technique for Hadoop in heterogeneous cloud environments. In: Ninth international conference on complex, intelligent, and software intensive systems, pp 217–224
Verma A, Cherkasova L (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327
Wang Y, Xu C, Li X, Yu W (2013) JVM-bypass for efficient Hadoop shuffling. In: IEEE 27th international symposium on parallel and distributed processing, pp 569–578
Xu Y, Wu S, Wang M, Zou Y (2018) Design and implementation of distributed RSA algorithm based on Hadoop. J Ambient Intell Hum Comput. https://doi.org/10.1007/s12652-018-1021-y
Yang L, Dai Y, Zhang B (2016) MapReduce scheduler by characterizing performance interference. China Commun 13(10):253–262
Yang Chi, Yang Xiaomin, Yang Feng (2019) A system based on Hadoop for radar data analysis. J Ambient Intell Hum Comput 10(10):3899–3913
Yao Y, Tai J, Sheng B, Mi N (2015) LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans Cloud Comput 3(4):411–424
Yuanquan FAN, Weiguo WU, Yunlong XU, Heng C (2014) Improving MapReduce performance by balancing skewed loads. China Commun 11(8):85–108
Acknowledgements
This study was supported by a National Research Foundation of Korea (NRF) Grant funded by the Korean government (NRF-2017R1C1B5017464).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Jeyaraj, R., Ananthanarayana, V.S. & Paul, A. Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment. J Ambient Intell Human Comput 11, 4261–4272 (2020). https://doi.org/10.1007/s12652-020-01707-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-01707-7