Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Jeyaraj, Rathinaraja; Ananthanarayana, V. S.; Paul, Anand

doi:10.1007/s12652-020-01707-7

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Original Research
Published: 22 January 2020

Volume 11, pages 4261–4272, (2020)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

235 Accesses
5 Citations
Explore all metrics

Abstract

Big data overwhelmed industries and research sectors. Reliable decision making is always a challenging task, which requires cost-effective big data processing tools. Hadoop MapReduce is being used to store and process huge volume of data in a distributed environment. However, due to huge capital investment and lack of expertise to set up an on-premise Hadoop cluster, big data users seek cloud-based MapReduce service over the Internet. Mostly, MapReduce on a cluster of virtual machines is offered as a service for a pay-per-use basis. Virtual machines in MapReduce virtual cluster reside in different physical machines and co-locate with other non-MapReduce VMs. This causes to share IO resources such as disk and network bandwidth, leading to congestion as most of the MapReduce jobs are disk and network intensive. Especially, the shuffle phase in MapReduce execution sequence consumes huge network bandwidth in a multi-tenant environment. This results in increased job latency and bandwidth consumption cost. Therefore, it is essential to minimize the amount of intermediate data in the shuffle phase rather than supplying more network bandwidth that results in increased service cost. Considering this objective, we extended multi-level per node combiner for a batch of MapReduce jobs to improve makespan. We observed that makespan is improved up to 32.4\(\%\) by minimizing the number of intermediate data in shuffle phase when compared to classical schedulers with default combiners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Big data analytics in Cloud computing: an overview

Article Open access 06 August 2022

Resource provisioning using workload clustering in cloud computing environment: a hybrid approach

Article 23 April 2020

References

Anjos JCS, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR (2015) MRA++: scheduling and data placement on MapReduce for heterogeneous environments. Fut Gener Comput Syst 42:22–35
Article Google Scholar
Bardhan S, Menasc D (2013) The anatomy of Mapreduce jobs, scheduling, and performance challenges. In: Conference of the computer measurement group
Berlińska J, Drozdowski M (2018) Comparing load-balancing algorithms for MapReduce under Zipfian data skews. Parallel Comput 72:14–28
Article MathSciNet Google Scholar
Blanca A, Shin SW (2013) Optimizing network usage in mapreduce scheduling. UC Berkely
Cassales Guilherme W, Charao AS, Kirsch-Pinheiro M, Souveyet C, Steffenel L-A (2016) Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling. J Ambient Intell Hum Comput 7(3):333–345
Article Google Scholar
Chen C, Lin J, Kuo S (2015a) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):1–14
Google Scholar
Chen Q, Yao J, Xiao Z (2015b) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533
Article Google Scholar
Chen W, Kumara BTGS, Paik I, Li Z (2015c) Topology-aware heuristic data allocation algorithm for big data infrastructure. In: IEEE first international conference on big data computing service and applications, pp 353–360
Chen W, Paik I, Li Z (2016) Tology-aware optimal data placement algorithm for network traffic optimization. IEEE Trans Comput 65(8):2603–2617
Article MathSciNet MATH Google Scholar
Chowdhury M, Zaharia M, Ma J, Jordan MI, Stoica I (2011) Managing data transfers in computer clusters with orchestra. Proc ACM SIGCOMM Conf 41(4):98–109
Article Google Scholar
Costa P, Donnelly A, Rowstron A, Shea GO (2012) Camdoop: exploiting in-network aggregation for big data applications. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, pp 1–14
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Sixth symposium on operating system design and implementation, pp 137–149
Dean J, Ghemawat S (2005a) Hadoop MapReduce fair scheduler. Hadoop MapReduce capacity scheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. Accessed July 2019
Dean J, Ghemawat S (2005b) Hadoop MapReduce fair scheduler. Hadoop MapReduce fair scheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. Accessed July 2019
Gufler A, BenjaGufler B. Augsten, Reiser N, Kemper A (2011) Handling data skew in MapReduce. In: Proceedings of the 1st international conference on cloud computing and services science, pp 574–583
Guo D, Xie J, Zhou X, Zhu X, Wei W, Luo X (2015) Exploiting efficient and scalable shuffle transfers in future data center networks. IEEE Trans Parallel Distrib Syst 26(4):997–1009
Article Google Scholar
Guo Y, Rao J, Jiang C (2017a) Moving Hadoop into the cloud with flexible slot management and speculative execution. IEEE Trans Parallel Distrib Syst 28(3):798–812
Article Google Scholar
Guo Y, Rao J, Cheng D, Member S (2017b) iShuffle: improving Hadoop performance with shuffle-on-write. IEEE Trans Parallel Distrib Syst 28(6):1649–1662
Article Google Scholar
Jean-Pierre C, Leo L, Riham E, Wei S (2018) MCSA: a multi-criteria shuffling algorithm for the MapReduce framework. In: IEEE SmartWorld, ubiquitous intelligence and computing, advanced and trusted computed, scalable computing and communications, cloud and big data computing, internet of people and smart city innovation
Jeyaraj R, Ananthanarayana VS (2018a) Dynamic performance aware reduce task scheduling in MapReduce on virtualized environment. In: IEEE 16th international conference on software engineering research, management and applications, pp 211–218
Jeyaraj R, Ananthanarayana VS (2018b) Multi-level per node combiner (MLPNC) to minimize mapreduce job latency on virtualized environment. In: Proceedings of the ACM symposium on applied computing, pp 167–174
Ke H, Li P, Guo S, Stojmenovic I (2015) Aggregation on the fly: reducing traffic for big data in the cloud. IEEE Netw 29(5):17–23
Article Google Scholar
Lee CW, Hsieh KY, Hsieh SY, Hsiao HC (2014) A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res 1:14–22
Article Google Scholar
Lee WH, Jun HG, Kim HJ (2015) Hadoop Mapreduce performance enhancement using in-node combiners. Int J Comput Sci Inf Technol 7(5):1–17
Google Scholar
Liang F, Lau FCM (2016) BAShuffler: maximizing network bandwidth utilization in the shuffle of YARN. In: Proceedings of the 25th ACM international symposium on high-performance parallel and distributed computing, pp 281–284
Liao J, Zhang L, Li T, Wang J, Qi Q (2016) Efficient and fair scheduler of multiple resources for MapReduce system. IET Softw 10:182–188
Article Google Scholar
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84
Article Google Scholar
Liu Z, Member S, Zhang Q, Member S, Ahmed R (2016) Dynamic resource allocation for MapReduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317
Article MathSciNet MATH Google Scholar
Lu W, Chen L, Wang L, Yuan H, Xing W, Yang Y (2018) NPIY: a novel partitioner for improving mapreduce performance. J Vis Lang Comput 46:1–11
Article Google Scholar
Ming-Chang Lee RY, Lin Jia-Chun (2016) Hybrid job-driven scheduling in virtual MapReduce cluster. IEEE Trans Parallel Distrib Syst 27(6):1687–1699
Article Google Scholar
Mohammad AI, Amir MR, Saeed S (2018) A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf Sci 501:662–679
Google Scholar
Myung J, Shim J, Yeon J, Lee SG (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299
Article Google Scholar
Rathinaraja J, Ananthanarayana VS, Anand P (2019) Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment. J Supercomput 75(11):7520–7549
Article Google Scholar
Shi W, Wang Y, Corriveau JP, Niu B, Croft WL, Peng M (2015) Smart shuffling in MapReduce: a solution to balance network traffic and workloads. In: IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 35–44
Shi Y, Zhang K, Cui L, Liu L, Zheng Y, Zhang S (2016) MapReduce short jobs optimization based on resource reuse. Microprocess Microsyst 47:178–187
Article Google Scholar
Ubarhande V, Popescu AM, González-Vélez H (2015) Novel data-distribution technique for Hadoop in heterogeneous cloud environments. In: Ninth international conference on complex, intelligent, and software intensive systems, pp 217–224
Verma A, Cherkasova L (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327
Article Google Scholar
Wang Y, Xu C, Li X, Yu W (2013) JVM-bypass for efficient Hadoop shuffling. In: IEEE 27th international symposium on parallel and distributed processing, pp 569–578
Xu Y, Wu S, Wang M, Zou Y (2018) Design and implementation of distributed RSA algorithm based on Hadoop. J Ambient Intell Hum Comput. https://doi.org/10.1007/s12652-018-1021-y
Article Google Scholar
Yang L, Dai Y, Zhang B (2016) MapReduce scheduler by characterizing performance interference. China Commun 13(10):253–262
Article Google Scholar
Yang Chi, Yang Xiaomin, Yang Feng (2019) A system based on Hadoop for radar data analysis. J Ambient Intell Hum Comput 10(10):3899–3913
Article Google Scholar
Yao Y, Tai J, Sheng B, Mi N (2015) LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans Cloud Comput 3(4):411–424
Article Google Scholar
Yuanquan FAN, Weiguo WU, Yunlong XU, Heng C (2014) Improving MapReduce performance by balancing skewed loads. China Commun 11(8):85–108
Article Google Scholar

Download references

Acknowledgements

This study was supported by a National Research Foundation of Korea (NRF) Grant funded by the Korean government (NRF-2017R1C1B5017464).

Author information

Authors and Affiliations

Department of IT, National Institute of Technology Karnataka, Mangalore, India
Rathinaraja Jeyaraj & V. S. Ananthanarayana
School of Computer Science and Engineering, Kyungpook National University, 80-Daehakro, Daegu, South Korea
Anand Paul

Authors

Rathinaraja Jeyaraj
View author publications
You can also search for this author in PubMed Google Scholar
V. S. Ananthanarayana
View author publications
You can also search for this author in PubMed Google Scholar
Anand Paul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Paul.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DTX 112 kb)

Supplementary material 2 (DTX 190 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jeyaraj, R., Ananthanarayana, V.S. & Paul, A. Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment. J Ambient Intell Human Comput 11, 4261–4272 (2020). https://doi.org/10.1007/s12652-020-01707-7

Download citation

Received: 15 April 2019
Accepted: 07 January 2020
Published: 22 January 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s12652-020-01707-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Big data analytics in Cloud computing: an overview

Resource provisioning using workload clustering in cloud computing environment: a hybrid approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (DTX 112 kb)

Supplementary material 2 (DTX 190 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Big data analytics in Cloud computing: an overview

Resource provisioning using workload clustering in cloud computing environment: a hybrid approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (DTX 112 kb)

Supplementary material 2 (DTX 190 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation