Skip to main content
Log in

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Big data overwhelmed industries and research sectors. Reliable decision making is always a challenging task, which requires cost-effective big data processing tools. Hadoop MapReduce is being used to store and process huge volume of data in a distributed environment. However, due to huge capital investment and lack of expertise to set up an on-premise Hadoop cluster, big data users seek cloud-based MapReduce service over the Internet. Mostly, MapReduce on a cluster of virtual machines is offered as a service for a pay-per-use basis. Virtual machines in MapReduce virtual cluster reside in different physical machines and co-locate with other non-MapReduce VMs. This causes to share IO resources such as disk and network bandwidth, leading to congestion as most of the MapReduce jobs are disk and network intensive. Especially, the shuffle phase in MapReduce execution sequence consumes huge network bandwidth in a multi-tenant environment. This results in increased job latency and bandwidth consumption cost. Therefore, it is essential to minimize the amount of intermediate data in the shuffle phase rather than supplying more network bandwidth that results in increased service cost. Considering this objective, we extended multi-level per node combiner for a batch of MapReduce jobs to improve makespan. We observed that makespan is improved up to 32.4\(\%\) by minimizing the number of intermediate data in shuffle phase when compared to classical schedulers with default combiners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Anjos JCS, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR (2015) MRA++: scheduling and data placement on MapReduce for heterogeneous environments. Fut Gener Comput Syst 42:22–35

    Article  Google Scholar 

  • Bardhan S, Menasc D (2013) The anatomy of Mapreduce jobs, scheduling, and performance challenges. In: Conference of the computer measurement group

  • Berlińska J, Drozdowski M (2018) Comparing load-balancing algorithms for MapReduce under Zipfian data skews. Parallel Comput 72:14–28

    Article  MathSciNet  Google Scholar 

  • Blanca A, Shin SW (2013) Optimizing network usage in mapreduce scheduling. UC Berkely

  • Cassales Guilherme W, Charao AS, Kirsch-Pinheiro M, Souveyet C, Steffenel L-A (2016) Improving the performance of Apache Hadoop on pervasive environments through context-aware scheduling. J Ambient Intell Hum Comput 7(3):333–345

    Article  Google Scholar 

  • Chen C, Lin J, Kuo S (2015a) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):1–14

    Google Scholar 

  • Chen Q, Yao J, Xiao Z (2015b) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533

    Article  Google Scholar 

  • Chen W, Kumara BTGS, Paik I, Li Z (2015c) Topology-aware heuristic data allocation algorithm for big data infrastructure. In: IEEE first international conference on big data computing service and applications, pp 353–360

  • Chen W, Paik I, Li Z (2016) Tology-aware optimal data placement algorithm for network traffic optimization. IEEE Trans Comput 65(8):2603–2617

    Article  MathSciNet  MATH  Google Scholar 

  • Chowdhury M, Zaharia M, Ma J, Jordan MI, Stoica I (2011) Managing data transfers in computer clusters with orchestra. Proc ACM SIGCOMM Conf 41(4):98–109

    Article  Google Scholar 

  • Costa P, Donnelly A, Rowstron A, Shea GO (2012) Camdoop: exploiting in-network aggregation for big data applications. In: Proceedings of the 9th USENIX conference on networked systems design and implementation, pp 1–14

  • Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Sixth symposium on operating system design and implementation, pp 137–149

  • Dean J, Ghemawat S (2005a) Hadoop MapReduce fair scheduler. Hadoop MapReduce capacity scheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html. Accessed July 2019

  • Dean J, Ghemawat S (2005b) Hadoop MapReduce fair scheduler. Hadoop MapReduce fair scheduler. https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/FairScheduler.html. Accessed July 2019

  • Gufler A, BenjaGufler B. Augsten, Reiser N, Kemper A (2011) Handling data skew in MapReduce. In: Proceedings of the 1st international conference on cloud computing and services science, pp 574–583

  • Guo D, Xie J, Zhou X, Zhu X, Wei W, Luo X (2015) Exploiting efficient and scalable shuffle transfers in future data center networks. IEEE Trans Parallel Distrib Syst 26(4):997–1009

    Article  Google Scholar 

  • Guo Y, Rao J, Jiang C (2017a) Moving Hadoop into the cloud with flexible slot management and speculative execution. IEEE Trans Parallel Distrib Syst 28(3):798–812

    Article  Google Scholar 

  • Guo Y, Rao J, Cheng D, Member S (2017b) iShuffle: improving Hadoop performance with shuffle-on-write. IEEE Trans Parallel Distrib Syst 28(6):1649–1662

    Article  Google Scholar 

  • Jean-Pierre C, Leo L, Riham E, Wei S (2018) MCSA: a multi-criteria shuffling algorithm for the MapReduce framework. In: IEEE SmartWorld, ubiquitous intelligence and computing, advanced and trusted computed, scalable computing and communications, cloud and big data computing, internet of people and smart city innovation

  • Jeyaraj R, Ananthanarayana VS (2018a) Dynamic performance aware reduce task scheduling in MapReduce on virtualized environment. In: IEEE 16th international conference on software engineering research, management and applications, pp 211–218

  • Jeyaraj R, Ananthanarayana VS (2018b) Multi-level per node combiner (MLPNC) to minimize mapreduce job latency on virtualized environment. In: Proceedings of the ACM symposium on applied computing, pp 167–174

  • Ke H, Li P, Guo S, Stojmenovic I (2015) Aggregation on the fly: reducing traffic for big data in the cloud. IEEE Netw 29(5):17–23

    Article  Google Scholar 

  • Lee CW, Hsieh KY, Hsieh SY, Hsiao HC (2014) A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res 1:14–22

    Article  Google Scholar 

  • Lee WH, Jun HG, Kim HJ (2015) Hadoop Mapreduce performance enhancement using in-node combiners. Int J Comput Sci Inf Technol 7(5):1–17

    Google Scholar 

  • Liang F, Lau FCM (2016) BAShuffler: maximizing network bandwidth utilization in the shuffle of YARN. In: Proceedings of the 25th ACM international symposium on high-performance parallel and distributed computing, pp 281–284

  • Liao J, Zhang L, Li T, Wang J, Qi Q (2016) Efficient and fair scheduler of multiple resources for MapReduce system. IET Softw 10:182–188

    Article  Google Scholar 

  • Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84

    Article  Google Scholar 

  • Liu Z, Member S, Zhang Q, Member S, Ahmed R (2016) Dynamic resource allocation for MapReduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317

    Article  MathSciNet  MATH  Google Scholar 

  • Lu W, Chen L, Wang L, Yuan H, Xing W, Yang Y (2018) NPIY: a novel partitioner for improving mapreduce performance. J Vis Lang Comput 46:1–11

    Article  Google Scholar 

  • Ming-Chang Lee RY, Lin Jia-Chun (2016) Hybrid job-driven scheduling in virtual MapReduce cluster. IEEE Trans Parallel Distrib Syst 27(6):1687–1699

    Article  Google Scholar 

  • Mohammad AI, Amir MR, Saeed S (2018) A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf Sci 501:662–679

    Google Scholar 

  • Myung J, Shim J, Yeon J, Lee SG (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299

    Article  Google Scholar 

  • Rathinaraja J, Ananthanarayana VS, Anand P (2019) Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment. J Supercomput 75(11):7520–7549

    Article  Google Scholar 

  • Shi W, Wang Y, Corriveau JP, Niu B, Croft WL, Peng M (2015) Smart shuffling in MapReduce: a solution to balance network traffic and workloads. In: IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 35–44

  • Shi Y, Zhang K, Cui L, Liu L, Zheng Y, Zhang S (2016) MapReduce short jobs optimization based on resource reuse. Microprocess Microsyst 47:178–187

    Article  Google Scholar 

  • Ubarhande V, Popescu AM, González-Vélez H (2015) Novel data-distribution technique for Hadoop in heterogeneous cloud environments. In: Ninth international conference on complex, intelligent, and software intensive systems, pp 217–224

  • Verma A, Cherkasova L (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327

    Article  Google Scholar 

  • Wang Y, Xu C, Li X, Yu W (2013) JVM-bypass for efficient Hadoop shuffling. In: IEEE 27th international symposium on parallel and distributed processing, pp 569–578

  • Xu Y, Wu S, Wang M, Zou Y (2018) Design and implementation of distributed RSA algorithm based on Hadoop. J Ambient Intell Hum Comput. https://doi.org/10.1007/s12652-018-1021-y

    Article  Google Scholar 

  • Yang L, Dai Y, Zhang B (2016) MapReduce scheduler by characterizing performance interference. China Commun 13(10):253–262

    Article  Google Scholar 

  • Yang Chi, Yang Xiaomin, Yang Feng (2019) A system based on Hadoop for radar data analysis. J Ambient Intell Hum Comput 10(10):3899–3913

    Article  Google Scholar 

  • Yao Y, Tai J, Sheng B, Mi N (2015) LsPS: a job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans Cloud Comput 3(4):411–424

    Article  Google Scholar 

  • Yuanquan FAN, Weiguo WU, Yunlong XU, Heng C (2014) Improving MapReduce performance by balancing skewed loads. China Commun 11(8):85–108

    Article  Google Scholar 

Download references

Acknowledgements

This study was supported by a National Research Foundation of Korea (NRF) Grant funded by the Korean government (NRF-2017R1C1B5017464).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Paul.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DTX 112 kb)

Supplementary material 2 (DTX 190 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jeyaraj, R., Ananthanarayana, V.S. & Paul, A. Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment. J Ambient Intell Human Comput 11, 4261–4272 (2020). https://doi.org/10.1007/s12652-020-01707-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-01707-7

Keywords

Navigation