Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Gandomi, Abolfazl; Movaghar, Ali; Reshadi, Midia; Khademzadeh, Ahmad

doi:10.1007/s11227-020-03162-9

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Published: 16 January 2020

Volume 76, pages 7177–7203, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Abolfazl Gandomi¹,
Ali Movaghar ORCID: orcid.org/0000-0002-6803-6750²,
Midia Reshadi¹ &
…
Ahmad Khademzadeh³

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Scalable Performance Modeling and Evaluation of MapReduce Applications

Comparison and Improvement of Hadoop MapReduce Performance Prediction Models in the Private Cloud

Towards optimal resource provisioning for Hadoop-MapReduce jobs using scale-out strategy and its performance analysis in private cloud environment

Article 26 February 2018

References

Dittrich J, Quiané-Ruiz J (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015. https://doi.org/10.14778/2367502.2367562
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Babu S (2010) Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 137–142. https://doi.org/10.1145/1807128.1807150
Lee K, Lee Y et al (2012) Parallel data processing with MapReduce. ACM SIGMOD Record 40(4):11–20. https://doi.org/10.1145/2094114.2094118
Article Google Scholar
White T, Cutting D (2015) Hadoop: the definitive guide. O’Reilly Media, Yahoo
Google Scholar
Arora A, Mehrotra S (2015) Learning YARN. Packt Publishing Ltd, Birmingham
Google Scholar
Vavilapalli VK, Murthy AC et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th ACM Annual Symposium on Cloud Computing, p 5. https://doi.org/10.1145/2523616.2523633
Hashem IA, Anuar NB, Marjani M, Ahmed E, Chiroma H, Firdaus A, Abdullah MT, Alotaibi F, Ali WK, Yaqoob I, Gani A (2018) MapReduce scheduling algorithms: a review. J Supercomput. https://doi.org/10.1007/s11227-018-2719-5
Article Google Scholar
Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York City
Google Scholar
Lin JC, Lee MC (2016) Performance evaluation of job schedulers on Hadoop YARN. Concurr Comput Practice Exp 28(9):2711–2728. https://doi.org/10.1002/cpe.3736
Article MathSciNet Google Scholar
Zaharia M, Borthakur D et al (2009) Job scheduling for multi-user MapReduce clusters. EECS Department University of California Berkeley Technical Report UCB/EECS-2009-55 Apr, (UCB/EECS-2009-55), vol 47, p 131
Gautam J, Prajapati H et al (2015) A survey on job scheduling algorithms in Big data processing. In: IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–11. https://doi.org/10.1109/ICECCT.2015.7226035
Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177. https://doi.org/10.1016/j.jnca.2018.11.007
Article Google Scholar
Witt C, Bux M, Gusew W, Leser U (2019) Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf Syst. https://doi.org/10.1016/j.is.2019.01.006
Article Google Scholar
Dong B, Zheng Q, Tian F, Chao KM, Godwin N, Ma T, Xu H (2014) Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J Syst Softw 93:132–151. https://doi.org/10.1016/j.jss.2014.02.038
Article Google Scholar
Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454. https://doi.org/10.1109/TPDS.2015.2405552
Article Google Scholar
Ataie E, Gianniti E, Ardagna D, Movaghar A (2017) A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in Hadoop clusters. In: MICAS 2017 Management of Resources and Services in Cloud and Sky Computing, pp 0–7. https://doi.org/10.1109/synasc.2016.072
Wang N, Yang J, Lu Z, Li X, Wu J (2016) Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud. In: Asia-Pacific Services Computing Conference. Springer, Cham, pp 77–91. https://doi.org/10.1007/978-3-319-49178-3_6
Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Proceedings of the VLDB Endowment, vol 4, no. 11, pp 1111–1122
Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for Big Data applications. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2019.06.009
Article Google Scholar
Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S (2011) Starfish: a self-tuning system for big data analytics. In: CIDR, vol 11, no 2011, pp 261–272
Herodotou H (2011) Hadoop performance models. Technical Report, CS-2011-05 Computer Science Department Duke University, p 19
Vianna E, Comarela G, Pontes T et al (2013) Analytical performance models for MapReduce workloads. Int J Parallel Prog 41(4):495–525. https://doi.org/10.1007/s10766-012-0227-4
Article Google Scholar
Liang DR, Tripathi SK (2000) On performance prediction of parallel computations with precedent constraints. IEEE Trans Parallel Distrib Syst 11(5):491–508. https://doi.org/10.1109/71.852402
Article Google Scholar
Glushkova D, Jovanovic P, Abelló A (2019) MapReduce performance model for Hadoop 2. x. Inf Syst 79:32–43. https://doi.org/10.1016/j.is.2017.11.006
Article Google Scholar
Liu Q, Cai W, Jin D, Shen J, Fu Z, Liu X, Linge N (2016) Estimation accuracy on execution time of run-time tasks in a heterogeneous distributed environment. Sensors 16(9):1386. https://doi.org/10.3390/s16091386
Article Google Scholar
Hammoud M, Sakr M (2011) Locality-aware reduce task scheduling for MapReduce. In: 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom), pp 570–576. https://doi.org/10.1109/CloudCom.2011.87
Zhang X, Feng Y et al (2011) An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: International Conference on Cloud and Service Computing (CSC), pp 235–242. https://doi.org/10.1109/CSC.2011.6138527
Wang G, Khasymski A, Krish KR, Butt AR (2013) Towards improving MapReduce task scheduling using online simulation based predictions. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 299–306. https://doi.org/10.1109/ICPADS.2013.50
Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings of ICWS, pp 102–109
Zaharia M, Konwinski A, Joseph A, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI, vol 8, no 4, p 7. https://dl.acm.org/doi/10.5555/1855741.1855744
Chen Q, Zhang D et al (2010) SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp 2736–2743. https://doi.org/10.1109/CIT.2010.458
Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079. https://doi.org/10.1007/s1122
Article Google Scholar
Zhang Q, Zhani MF, Yang Y, Boutaba R, Wong B (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 3(2):182–194. https://doi.org/10.1109/tcc.2014.2379096
Article Google Scholar
Polo J, Castillo C et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011, pp 187–207. https://dl.acm.org/doi/10.5555/2414338.2414352
Lama P, Zhou X (2012) AROMA: automated resource allocation and configuration of MapReduce environment in the cloud. In: Proceedings of the 9th ACM International Conference on AUTONOMIC COMPUTING, pp 63–72. https://doi.org/10.1145/2371536.2371547
Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp 235–244. https://doi.org/10.1145/1998582.1998637
Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. https://doi.org/10.1109/tc.2013.15
Article MathSciNet MATH Google Scholar
Wang Y et al (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604. https://doi.org/10.1007/s10723-015-9350-y
Article Google Scholar
Tang S, Lee BS, He B (2014) DynamicMR: a dynamic slot allocation optimization framework for MapReduce clusters. IEEE Trans Cloud Comput 2(3):333–347. https://doi.org/10.1109/tcc.2014.2329299
Article Google Scholar
Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327. https://doi.org/10.1109/TDSC.2013.14
Article Google Scholar
Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393. https://doi.org/10.1007/s11227-016-1737-4
Article Google Scholar
Tang S, Lee B, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17. https://doi.org/10.1109/TSC.2015.2426186
Article Google Scholar
Zhang Z, Cherkasova L, Loo BT (2013) Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, pp 253–258. https://doi.org/10.1145/2479871.2479906
Yao Y, Wang J, Sheng B, Lin J, Mi N (2014) HASTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp 184–191. https://doi.org/10.1109/CLOUD.2014.34
Wasi-ur-Rahman M, Lu X, Islam NS, Rajachandrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp 291–300. https://doi.org/10.1109/IPDPS.2015.83
Verma A, Cherkasova L, Campbell RH (2011) Resource provisioning framework for MapReduce jobs with performance goals. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, Berlin, pp 165–186. https://doi.org/10.1007/978-3-642-25821-3_9
Hamooni H, Debnath B, Xu J et al (2016) LogMine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1573–1582. https://doi.org/10.1145/2983323.2983358
Sheu RK, Yuan SM, Lo WT, Ku CI (2014) Design and implementation of file deduplication framework on HDFS. Int J Distrib Sens Netw 10(4):561340. https://doi.org/10.1155/2014/561340
Article Google Scholar
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp 41–51. https://doi.org/10.1109/ICDEW.2010.5452747

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Abolfazl Gandomi & Midia Reshadi
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Ali Movaghar
Iran Telecommunication Research Center, ITRC, Tehran, Iran
Ahmad Khademzadeh

Authors

Abolfazl Gandomi
View author publications
You can also search for this author in PubMed Google Scholar
Ali Movaghar
View author publications
You can also search for this author in PubMed Google Scholar
Midia Reshadi
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Khademzadeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Movaghar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gandomi, A., Movaghar, A., Reshadi, M. et al. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. J Supercomput 76, 7177–7203 (2020). https://doi.org/10.1007/s11227-020-03162-9

Download citation

Published: 16 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11227-020-03162-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Abstract

Access this article

Similar content being viewed by others

Scalable Performance Modeling and Evaluation of MapReduce Applications

Comparison and Improvement of Hadoop MapReduce Performance Prediction Models in the Private Cloud

Towards optimal resource provisioning for Hadoop-MapReduce jobs using scale-out strategy and its performance analysis in private cloud environment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Abstract

Access this article

Similar content being viewed by others

Scalable Performance Modeling and Evaluation of MapReduce Applications

Comparison and Improvement of Hadoop MapReduce Performance Prediction Models in the Private Cloud

Towards optimal resource provisioning for Hadoop-MapReduce jobs using scale-out strategy and its performance analysis in private cloud environment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation