MapReduce: an infrastructure review and research insights

Maleki, Neda; Rahmani, Amir Masoud; Conti, Mauro

doi:10.1007/s11227-019-02907-5

MapReduce: an infrastructure review and research insights

Published: 08 June 2019

Volume 75, pages 6934–7002, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Neda Maleki¹,
Amir Masoud Rahmani¹ &
Mauro Conti²

1397 Accesses
17 Citations
Explore all metrics

Abstract

In the current decade, doing the search on massive data to find “hidden” and valuable information within it is growing. This search can result in heavy processing on considerable data, leading to the development of solutions to process such huge information based on distributed and parallel processing. Among all the parallel programming models, one that gains a lot of popularity is MapReduce. The goal of this paper is to survey researches conducted on the MapReduce framework in the context of its open-source implementation, Hadoop, in order to summarize and report the wide topic area at the infrastructure level. We managed to do a systematic review based on the prevalent topics dealing with MapReduce in seven areas: (1) performance; (2) job/task scheduling; (3) load balancing; (4) resource provisioning; (5) fault tolerance in terms of availability and reliability; (6) security; and (7) energy efficiency. We run our study by doing a quantitative and qualitative evaluation of the research publications’ trend which is published between January 1, 2014, and November 1, 2017. Since the MapReduce is a challenge-prone area for researchers who fall off to work and extend with, this work is a useful guideline for getting feedback and starting research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce Parallel Programming Model: A State-of-the-Art Survey

Article 29 October 2015

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

MapReduce scheduling algorithms: a review

Article 10 December 2018

References

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Hashem IAT, Anuar NB, Gani A, Yaqoob I, Xia F, Khan SU (2016) MapReduce: review and open challenges. Scientometrics 109(1):389–422
Article Google Scholar
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R (2010) Hive—a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)
Polato I, Ré R, Goldman A, Kon F (2014) A comprehensive view of Hadoop research—a systematic literature review. J Netw Comput Appl 46:1–25
Article Google Scholar
Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687
Article Google Scholar
Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347
Article Google Scholar
Chen M, Mao S, Liu Y (2014) Big data: a survey. Mob Netw Appl 19(2):171–209
Article Google Scholar
http://spark.apache.org/
http://datampi.org/
Soualhia M, Khomh F, Tahar S (2017) Task scheduling in big data platforms: a systematic literature review. J Syst Softw 134:170–189
Article Google Scholar
Zhang B, Wang X, Zheng Z (2018) The optimization for recurring queries in big data analysis system with MapReduce. Future Gener Comput Syst 87:549–556
Article Google Scholar
http://hadoop.apache.org/
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST)
White T (2009) Hadoop: the definitive guide. O’Reilly Media Inc, Sebastopol
Google Scholar
Kao Y-C, Chen Y-S (2016) Data-locality-aware mapreduce real-time scheduling framework. J Syst Softw 112:65–77
Article Google Scholar
Wang F, Qiu J, Yang J, Dong B, Li X, Li Y (2009) Hadoop high availability through metadata replication. In: Proceedings of the first international workshop on cloud data management. ACM, Hong Kong, pp 37–44
Li F, Ooi BC, Tamer Ozsu M, Wu S (2014) Distributed data management using MapReduce. ACM Comput Surv 46(3):1–42
Google Scholar
Singh R, Kaur PJ (2016) Analyzing performance of Apache Tez and MapReduce with Hadoop multinode cluster on Amazon cloud. J Big Data 3(1):19
Article MathSciNet Google Scholar
https://www.bogotobogo.com/Hadoop/BigData_hadoop_Ecosystem.php
Wang H, Chen H, Du Z, Hu F (2016) BeTL: MapReduce checkpoint tactics beneath the task level. IEEE Trans Serv Comput 9(1):84–95
Google Scholar
Alapati SR (2016) Expert Hadoop administration: managing, tuning, and securing spark, YARN, and HDFS. Addison-Wesley Professional, Boston
Google Scholar
Gupta M, Patwa F, Sandhu R (2017) Object-tagged RBAC model for the Hadoop ecosystem. In: IFIP Annual Conference on Data and Applications Security and Privacy. Springer
Erraissi A, Belangour A, Tragha A (2017) A big data Hadoop building blocks comparative study. Int J Comput Trends Technol 48(1):36–40
Article Google Scholar
Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. Inf Softw Technol 64:1–18
Article Google Scholar
Cruz-Benito J (2016) Systematic literature review & mapping. https://doi.org/10.5281/zenodo.165773
Lu Q, Zhu L, Zhang H, Wu D, Li Z, Xu X (2015) MapReduce job optimization: a mapping study. In: 2015 International Conference on Cloud Computing and Big Data (CCBD)
Charband Y, Navimipour NJ (2016) Online knowledge sharing mechanisms: a systematic review of the state of the art literature and recommendations for future research. Inf Syst Front 18(6):1131–1151
Article Google Scholar
Poggi N, Carrera D, Call A, Mendoza S, Becerra Y, Torres J, Ayguadé E, Gagliardi F, Labarta J, Reinauer R, Vujic N, Green D, Blakeley J (2014) ALOJA: a systematic study of Hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data (Big Data)
Sharma M, Hasteer N, Tuli A, Bansal A (2014) Investigating the inclinations of research and practices in Hadoop: a systematic review. In: 2014 5th International Conference—Confluence the Next Generation Information Technology Summit (Confluence)
Thakur S, Ramzan M (2016) A systematic review on cardiovascular diseases using big-data by Hadoop. In: 2016 6th International Conference—Cloud System and Big Data Engineering (Confluence)
Lu J, Feng J (2014) A survey of mapreduce based parallel processing technologies. China Commun 11(14):146–155
Article Google Scholar
Derbeko P, Dolev S, Gudes E, Sharma S (2016) Security and privacy aspects in MapReduce on clouds: a survey. Comput Sci Rev 20:1–28
Article MathSciNet MATH Google Scholar
Li R, Hu H, Li H, Wu Y, Yang J (2016) MapReduce parallel programming model: a state-of-the-art survey. Int J Parallel Prog 44(4):832–866
Article Google Scholar
Iyer GN, Silas S (2015) a comprehensive survey on data-intensive computing and mapreduce paradigm in cloud computing environments. In: Rajsingh EB, Bhojan A, Peter JD (eds) Informatics and communication technologies for societal development: proceedings of ICICTS 2014. Springer India, New Delhi, pp 85–93
Liu Q, Jin D, Liu X, Linge N (2016) a survey of speculative execution strategy in MapReduce. In: Sun X, Liu A, Chao H-C, Bertino E (eds) Cloud Computing and Security: Second International Conference, ICCCS 2016, Nanjing, China, July 29–31, 2016, Revised Selected Papers, Part I. Springer, Cham, pp 296–307
Mashayekhy L, Nejad MM, Grosu D, Zhang Q, Shi W (2015) Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733
Article Google Scholar
Ibrahim S, Phan T-D, Carpen-Amarie A, Chihoub H-E, Moise D, Antoniu G (2016) Governing energy consumption in Hadoop through cpu frequency scaling: an analysis. Future Gener Comput Syst 54:219–232
Article Google Scholar
Song J, He H, Wang Z, Yu G, Pierson J-M (2016) Modulo based data placement algorithm for energy consumption optimization of MapReduce system. J Grid Comput 1:1–16
Google Scholar
Cai X, Li F, Li P, Ju L, Jia Z (2017) SLA-aware energy-efficient scheduling scheme for Hadoop YARN. J Supercomput 73(8):3526–3546
Article Google Scholar
Teng F, Yu L, Li T, Deng D, Magoulès F (2017) Energy efficiency of VM consolidation in IaaS clouds. J Supercomput 73(2):782–809
Article Google Scholar
Phan T-D, Ibrahim S, Zhou AC, Aupy G, Antoniu G (2017) Energy-driven straggler mitigation in MapReduce. In: European Conference on Parallel Processing. Springer
Arjona Aroca J, Chatzipapas A, Fernández Anta A, Mancuso V (2014) A measurement-based analysis of the energy consumption of data center servers. In: Proceedings of the 5th International Conference on Future Energy Systems. ACM
Fu H, Chen H, Zhu Y, Yu W (2017) FARMS: efficient mapreduce speculation for failure recovery in short jobs. Parallel Comput 61:68–82
Article MathSciNet Google Scholar
Tang B, Tang M, Fedak G, He H (2017) Availability/network-aware MapReduce over the internet. Inf Sci 379:94–111
Article Google Scholar
Memishi B, Pérez MS, Antoniu G (2017) Failure detector abstractions for MapReduce-based systems. Inf Sci 379:112–127
Article Google Scholar
Yildiz O, Ibrahim S, Antoniu G (2017) Enabling fast failure recovery in shared Hadoop clusters: towards failure-aware scheduling. Future Gener Comput Syst 74:208–219
Article Google Scholar
Lin J-C, Leu F-Y, Chen Y-P (2015) Analyzing job completion reliability and job energy consumption for a heterogeneous MapReduce cluster under different intermediate-data replication policies. J Supercomput 71(5):1657–1677
Article Google Scholar
Xu X, Cao L, Wang X (2016) Adaptive task scheduling strategy based on dynamic workload adjustment for heterogeneous Hadoop clusters. IEEE Syst J 10(2):471–482
Article Google Scholar
Lim N, Majumdar S, Ashwood-Smith P (2017) MRCP-RM: a technique for resource allocation and scheduling of MapReduce jobs with deadlines. IEEE Trans Parallel Distrib Syst 28(5):1375–1389
Article Google Scholar
Sun M, Zhuang H, Li C, Lu K, Zhou X (2016) Scheduling algorithm based on prefetching in MapReduce clusters. Appl Soft Comput 38:1109–1118
Article Google Scholar
Tang Z, Jiang L, Zhou J, Li K, Li K (2015) A self-adaptive scheduling algorithm for reduce start time. Future Gener Comput Syst 43:51–60
Article Google Scholar
Bok K, Hwang J, Lim J, Kim Y, Yoo J (2016) An efficient MapReduce scheduling scheme for processing large multimedia data. Multimed Tools Appl 76(16):1–24
Google Scholar
Zaharia M, Borthakur D, Sarma JS, Elmeleegy K, Shenker S, Stoica I (2010) Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer systems. ACM, Paris, pp 265–278
Hashem IAT, Anuar NB, Marjani M, Gani A, Sangaiah AK, Sakariyah AK (2017) Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77(8):1–16
Google Scholar
Nita M-C, Pop F, Voicu C, Dobre C, Xhafa F (2015) MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Cluster Comput 18(3):1011–1024
Article Google Scholar
Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079
Article Google Scholar
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533
Article Google Scholar
Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y, Gong Z (2016) Dynamic resource allocation for MapReduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317
Article MathSciNet MATH Google Scholar
Chen W, Paik I, Li Z (2016) Topology-aware optimal data placement algorithm for network traffic optimization. IEEE Trans Comput 65(8):2603–2617
Article MathSciNet MATH Google Scholar
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84
Article Google Scholar
Myung J, Shim J, Yeon J, Lee S-G (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299
Article Google Scholar
Liu Z, Zhang Q, Boutaba R, Liu Y, Wang B (2016) OPTIMA: on-line partitioning skew mitigation for MapReduce with resource adjustment. J Netw Syst Manag 24(4):859–883
Article Google Scholar
Zhang X, Jiang J, Zhang X, Wang X (2015) A data transmission algorithm for distributed computing system based on maximum flow. Cluster Comput 18(3):1157–1169
Article Google Scholar
Tang S, Lee BS, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17
Article Google Scholar
Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327
Article Google Scholar
Bei Z, Yu Z, Zhang H, Xiong W, Xu C, Eeckhout L, Feng S (2016) RFHOC: a random-forest approach to auto-tuning Hadoop’s configuration. IEEE Trans Parallel Distrib Syst 27(5):1470–1483
Article Google Scholar
Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous MapReduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786
Article Google Scholar
Yu W, Wang Y, Que X (2014) Design and evaluation of network-levitated merge for Hadoop acceleration. IEEE Trans Parallel Distrib Syst 25(3):602–611
Article Google Scholar
Guo D, Xie J, Zhou X, Zhu X, Wei W, Luo X (2015) Exploiting efficient and scalable shuffle transfers in future data center networks. IEEE Trans Parallel Distrib Syst 26(4):997–1009
Article Google Scholar
Guo Y, Rao J, Cheng D, Zhou X (2017) iShuffle: improving Hadoop performance with shuffle-on-write. IEEE Trans Parallel Distrib Syst 28(6):1649–1662
Article Google Scholar
Maleki N, Rahmani AM, Conti M (2018) POSTER: an intelligent framework to parallelize Hadoop phases. In: Proceedings of the 27th international symposium on high-performance parallel and distributed computing. ACM
Ke H, Li P, Guo S, Guo M (2016) On traffic-aware partition and aggregation in mapreduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828
Article Google Scholar
Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967
Article MathSciNet MATH Google Scholar
Guo Y, Rao J, Jiang C, Zhou X (2017) Moving Hadoop into the cloud with flexible slot management and speculative execution. IEEE Trans Parallel Distrib Syst 28(3):798–812
Article Google Scholar
Xu H, Lau WC (2017) Optimization for speculative execution in big data processing clusters. IEEE Trans Parallel Distrib Syst 28(2):530–545
Google Scholar
Jiang Y, Zhu Y, Wu W, Li D (2017) Makespan minimization for MapReduce systems with different servers. Future Gener Comput Syst 67:13–21
Article Google Scholar
Veiga J, Expósito RR, Taboada GL, Tourino J (2016) Flame-MR: an event-driven architecture for MapReduce applications. Future Gener Comput Syst 65:46–56
Article Google Scholar
Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, San Diego, pp 29–42
Huang X, Zhang L, Li R, Wan L, Li K (2016) Novel heuristic speculative execution strategies in heterogeneous distributed environments. Comput Electr Eng 50:166–179
Article Google Scholar
Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393
Article Google Scholar
Wang Y, Lu W, Lou R, Wei B (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604
Article Google Scholar
Fu X, Gao Y, Luo B, Du X, Guizani M (2017) Security threats to Hadoop: data leakage attacks and investigation. IEEE Netw 31(2):67–71
Article Google Scholar
Parmar RR, Roy S, Bhattacharyya D, Bandyopadhyay SK, Kim T (2017) Large-Scale Encryption in the Hadoop Environment: challenges and Solutions. IEEE Access 5:7156–7163
Article Google Scholar
Gupta M, Patwa F, Benson J, Sandhu R (2017) Multi-layer authorization framework for a representative Hadoop ecosystem deployment. In: Proceedings of the 22nd ACM on symposium on access control models and technologies. ACM
Wang J, Wang T, Yang Z, Mao Y, Mi N, Sheng B (2017) Seina: a stealthy and effective internal attack in Hadoop systems. In: 2017 International Conference on Computing, Networking and Communications (ICNC). IEEE
Ohrimenko O, Costa M, Fournet C, Gkantsidis C, Kohlweiss M, Sharma D (2015) Observing and preventing leakage in MapReduce. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, Denver, pp 1570–1581
Ulusoy H, Colombo P, Ferrari E, Kantarcioglu M, Pattuk E (2015) GuardMR: fine-grained security policy enforcement for MapReduce systems. In: Proceedings of the 10th ACM symposium on information, computer and communications security. ACM, Singapore, pp 285–296
Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454
Article Google Scholar
Nghiem PP, Figueira SM (2016) Towards efficient resource provisioning in MapReduce. J Parallel Distrib Comput 95:29–41
Article Google Scholar
Tang Z, Wang W, Huang Y, Wu H, Wei J, Huang T (2017) Application-centric SSD cache allocation for Hadoop applications. In: Proceedings of the 9th Asia-pacific symposium on internetware. ACM
Hadoop S (2016) Security recommendations for Hadoop environments. White paper, Securosis
Google Scholar
Garman J (2003) Kerberos: the definitive guide. O'Reilly Media, Inc

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Neda Maleki & Amir Masoud Rahmani
Department of Mathematics, University of Padua, Padua, Italy
Mauro Conti

Authors

Neda Maleki
View author publications
You can also search for this author in PubMed Google Scholar
Amir Masoud Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Mauro Conti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Masoud Rahmani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maleki, N., Rahmani, A.M. & Conti, M. MapReduce: an infrastructure review and research insights. J Supercomput 75, 6934–7002 (2019). https://doi.org/10.1007/s11227-019-02907-5

Download citation

Published: 08 June 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11227-019-02907-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce: an infrastructure review and research insights

Abstract

Access this article

Similar content being viewed by others

MapReduce Parallel Programming Model: A State-of-the-Art Survey

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

MapReduce scheduling algorithms: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MapReduce: an infrastructure review and research insights

Abstract

Access this article

Similar content being viewed by others

MapReduce Parallel Programming Model: A State-of-the-Art Survey

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

MapReduce scheduling algorithms: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation