MapReduce: Review and open challenges

Hashem, Ibrahim Abaker Targio; Anuar, Nor Badrul; Gani, Abdullah; Yaqoob, Ibrar; Xia, Feng; Khan, Samee Ullah

doi:10.1007/s11192-016-1945-y

MapReduce: Review and open challenges

Published: 15 April 2016

Volume 109, pages 389–422, (2016)
Cite this article

Scientometrics Aims and scope Submit manuscript

Ibrahim Abaker Targio Hashem¹,
Nor Badrul Anuar¹,
Abdullah Gani¹,
Ibrar Yaqoob¹,
Feng Xia² &
…
Samee Ullah Khan³

3129 Accesses
55 Citations
1 Altmetric
Explore all metrics

Abstract

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. The MapReduce computational paradigm is a major enabler for underlying numerous big data platforms. MapReduce is a popular tool for the distributed and scalable processing of big data. It is increasingly being used in different applications primarily because of its important features, including scalability, fault tolerance, ease of programming, and flexibility. Thus, bibliometric analysis and review was conducted to evaluate the trend of MapReduce research assessment publications indexed in Scopus from 2006 to 2015. This trend includes the use of the MapReduce framework for big data processing and its development. The study analyzed the distribution of published articles, countries, authors, keywords, and authorship pattern. For data visualization, VOSviewer program was used to produce distance- and graph-based maps. The top 10 most cited articles were also identified based on the citation count of publications. The study utilized productivity measures, domain visualization techniques and co-word to explore papers related to MapReduce in the field of big data. Moreover, the study discussed the most influential articles contributed to the improvements in MapReduce and reviewed the corresponding solutions. Finally, it presented several open challenges on big data processing with MapReduce as future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Afrati, F., Dolev, S., Korach, E., Sharma, S., & Ullman, J. D. (2015). Assignment problems of different-sized inputs in mapreduce. arXiv:1507.04461.
Ahmad, F., Lee, S., Thottethodi, M., & Vijaykumar, T. (2013). MapReduce with communication overlap (MaRCO). Journal of Parallel and Distributed Computing, 73(5), 608–620.
Article Google Scholar
Anjos, J. C., Carrera, I., Kolberg, W., Tibola, A. L., Arantes, L. B., & Geyer, C. R. (2015). MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems, 42, 22–35.
Article Google Scholar
Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., et al. (2011). Jaql: A scripting language for large scale semistructured data analysis. Proceedings of VLDB conference, 4(12), 1272–1283.
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Proceedings of the 2nd ACM Symposium on Cloud Computing. doi:10.1145/2038916.2038923.
Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data. Washington, DC: Aspen Institute, Communications and Society Program.
Google Scholar
Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1–2), 285–296.
Article Google Scholar
Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., et al. (2014). Hawq: A massively parallel processing sql engine in hadoop. Paper presented at the proceedings of the 2014 ACM SIGMOD international conference on management of data.
Chen, S. (2010). Cheetah: A high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 1459–1468.
Article Google Scholar
Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3.
Google Scholar
Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347. doi:10.1016/j.ins.2014.01.015.
Article Google Scholar
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.
Article MathSciNet Google Scholar
Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014). Optimized big data K-means clustering using MapReduce. The Journal of Supercomputing, 70(3), 1249–1259.
Article Google Scholar
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Article Google Scholar
Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.
Article Google Scholar
Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37(6), 817–842. doi:10.1016/S0306-4573(00)00051-0.
Article MATH Google Scholar
Ding, L., Wang, G., Xin, J., Wang, X., Huang, S., & Zhang, R. (2013). ComMapReduce: An improvement of mapreduce with lightweight communication mechanisms. Data & Knowledge Engineering, 88, 224–247.
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1–2), 515–529.
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., & Schad, J. (2012). Only aggressive elephants are fast elephants. Proceedings of the VLDB Endowment, 5(11), 1591–1602.
Article Google Scholar
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: A runtime for iterative mapreduce. Paper presented at the proceedings of the 19th ACM international symposium on high performance distributed computing.
Falagas, M. E., Pitsouni, E. I., Malietzis, G. A., & Pappas, G. (2008). Comparison of PubMed, Scopus, web of science, and Google scholar: Strengths and weaknesses. The FASEB Journal, 22(2), 338–342.
Article Google Scholar
Floratou, A., Patel, J. M., Shekita, E. J., & Tata, S. (2011). Column-oriented storage techniques for MapReduce. Proceedings of the VLDB Endowment, 4(7), 419–429.
Article Google Scholar
Friedman, E., Pawlowski, P., & Cieslewicz, J. (2009). SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proceedings of the VLDB Endowment, 2(2), 1402–1413.
Article Google Scholar
Fu, H.-Z., Wang, M.-H., & Ho, Y.-S. (2013). Mapping of drinking water research: A bibliometric analysis of research output during 1992–2011. Science of the Total Environment, 443, 757–765.
Article Google Scholar
Gani, A., Siddiqa, A., Shamshirband, S., & Hanum, F. (2016). A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and Information Systems, 46(2), 241–284.
Article Google Scholar
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.
Ghit, B., Yigitbasi, N., Iosup, A., & Epema, D. (2014). Balanced resource allocations across multiple dynamic MapReduce clusters. Paper presented at the ACM SIGMETRICS.
Greenspan, J., & Valkova, S. (2014). Using big healthcare data for ILI situational awareness in Georgia. Online Journal of Public Health Informatics, 6(1). doi:10.5210/ojphi.v6i1.5193.
Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., & Huang, Y. (2014). SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. Journal of Parallel and Distributed Computing, 74(3), 2166–2179.
Article Google Scholar
Gunarathne, T., Wu, T.-L., Qiu, J., & Fox, G. (2010). MapReduce in the clouds for science. Paper presented at the 2010 IEEE second international conference on cloud computing technology and science (CloudCom).
Gunarathne, T., Zhang, B., Wu, T.-L., & Qiu, J. (2013). Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Generation Computer Systems, 29(4), 1035–1048.
Article Google Scholar
Hadoop, A. (2011). Apache Hadoop. Retrieved from https://hadoop.apache.org/.
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., & Xu, Z. (2011). RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. Paper presented at the 2011 IEEE 27th international conference on data engineering (ICDE).
Hsu, C.-H. (2014). Intelligent big data processing. Future Generation Computer Systems, 36, 16–18. doi:10.1016/j.future.2014.02.003.
Article Google Scholar
Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., et al. (2014). DualTable: A hybrid storage model for update optimization in hive. arXiv preprint arXiv:1404.6878.
Ibrahim, S., Phan, T.-D., Carpen-Amarie, A., Chihoub, H.-E., Moise, D., & Antoniu, G. (2016). Governing energy consumption in Hadoop through CPU frequency scaling: An analysis. Future Generation Computer Systems. doi:10.1016/j.future.2015.01.005.
Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., & Wu, S. (2012). Maestro: Replica-aware map scheduling for mapreduce. Paper presented at the 2012 12th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid).
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. Paper presented at the proceedings of the ACM SIGOPS 22nd symposium on operating systems principles.
Jiang, H., Chen, Y., Qiao, Z., Weng, T.-H., & Li, K.-C. (2014). Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing, 18(1), 1–15.
Google Scholar
Jindal, A., Quiané-Ruiz, J.-A., & Dittrich, J. (2011). Trojan data layouts: Right shoes for a running elephant. Paper presented at the proceedings of the 2nd ACM symposium on cloud computing.
Kalavri, V., & Vlassov, V. (2013). Mapreduce: Limitations, optimizations and open issues. Paper presented at the 2013 12th IEEE international conference on trust, security and privacy in computing and communications (TrustCom).
Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–2573.
Article Google Scholar
Kim, G.-H., Trimi, S., & Chung, J.-H. (2014). Big-data applications in the government sector. Communications of the ACM, 57(3), 78–85.
Article Google Scholar
Labrinidis, A., & Jagadish, H. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5(12), 2032–2033.
Article Google Scholar
Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814–1825. doi:10.14778/2367502.2367520.
Article Google Scholar
Lama, P., & Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. Paper presented at the proceedings of the 9th international conference on Autonomic computing.
Lämmel, R. (2008). Google’s MapReduce programming model—Revisited. Science of Computer Programming, 70(1), 1–30. doi:10.1016/j.scico.2007.07.001.
Article MathSciNet MATH Google Scholar
Lee, D., Kim, J.-S., & Maeng, S. (2014). Large-scale incremental processing with MapReduce. Future Generation Computer Systems, 36, 66–79. doi:10.1016/j.future.2013.09.010.
Article Google Scholar
Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., & Wong, M. (2011a). Tenzing a sql implementation on the mapreduce framework. Proceedings of the VLDB Endowment, 4(12), 1318–1327.
Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011b). Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. Paper presented at the proceedings of the 2011 ACM SIGMOD international conference on management of data.
Lin, M., Zhang, L., Wierman, A., & Tan, J. (2013). Joint optimization of overlapping phases in MapReduce. Performance Evaluation, 70(10), 720–735.
Article Google Scholar
Lyon, D. (2014). Surveillance, snowden, and big data: Capacities, consequences, critique. Big Data & Society, 1(2), 2053951714541861.
Article Google Scholar
Maheshwari, N., Nanduri, R., & Varma, V. (2012). Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Generation Computer Systems, 28(1), 119–127.
Article Google Scholar
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.
Mao, G., Zou, H., Chen, G., Du, H., & Zuo, J. (2015). Past, current and future of biomass energy research: A bibliometric analysis. Renewable and Sustainable Energy Reviews, 52, 1823–1833. doi:10.1016/j.rser.2015.07.141.
Article Google Scholar
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D., & Barton, D. (2012). Big Data. The management revolution. Harvard Bus Rev, 90(10), 61–67.
Google Scholar
McCreadie, R., Macdonald, C., & Ounis, I. (2012). MapReduce indexing strategies: Studying scalability and efficiency. Information Processing and Management, 48(5), 873–888.
Article Google Scholar
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., & Daly, M. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303.
Article Google Scholar
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.
Article Google Scholar
Mihaylov, S. R., Ives, Z. G., & Guha, S. (2012). REX: Recursive, delta-based data-centric computation. Proceedings of the VLDB Endowment, 5(11), 1280–1291.
Article Google Scholar
Murthy, A. C., Douglas, C., Konar, M., O’Malley, O., Radia, S., Agarwal, S., et al. (2011). Architecture of next generation Apache Hadoop MapReduce framework. Technical report, Apache Hadoop.
Murthy, A. C., Vavilapalli, V. K., Eadline, D., Niemiec, J., & Markham, J. (2013). Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2. Boca Raton: Taylor & Francis.
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., & Koudas, N. (2010). MRShare: Sharing across multiple queries in MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 494–505.
Article MATH Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: A not-so-foreign language for data processing. Paper presented at the proceedings of the 2008 ACM SIGMOD international conference on management of data.
Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. (2005). Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4), 277–298.
Article Google Scholar
Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1–25. doi:10.1016/j.jnca.2014.07.022.
Article Google Scholar
Qi, C., Cheng, L., & Zhen, X. (2014). Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers, 63(4), 954–967. doi:10.1109/TC.2013.15.
Article MathSciNet Google Scholar
Rasooli, A., & Down, D. G. (2014). COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, 36, 1–15.
Article Google Scholar
Richter, S., Quiané-Ruiz, J.-A., Schuh, S., & Dittrich, J. (2012). Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480.
Rothstein, M. A. (2015). Ethical Issues in Big Data Health Research. Journal of Law, Medicine and Ethics, 43(2), 425–429.
Article Google Scholar
Sakr, S., Liu, A., & Fayoumi, A. G. (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 11.
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. Paper presented at the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).
Srirama, S. N., Jakovits, P., & Vainikko, E. (2012). Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems, 28(1), 184–192.
Article Google Scholar
Sun, J., Wang, M.-H., & Ho, Y.-S. (2012). A historical review and bibliometric analysis of research on estuary pollution. Marine Pollution Bulletin, 64(1), 13–21.
Article Google Scholar
Talia, D. (2013). Clouds for scalable big data analytics. Computer, 46(5), 98–101. doi:10.1109/MC.2013.162.
Article Google Scholar
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629.
Article Google Scholar
van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.
Google Scholar
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., et al. (2013). Apache hadoop yarn: Yet another resource negotiator. Paper presented at the proceedings of the 4th annual symposium on cloud computing.
Verma, A., Cherkasova, L., & Campbell, R. H. (2011). ARIA: Automatic resource inference and allocation for mapreduce environments. Paper presented at the proceedings of the 8th ACM international conference on autonomic computing.
White, T. (2009). Hadoop: The definitive guide: The definitive guide. Sebastopol: O’Reilly Media.
Wirtz, T., & Ge, R. (2011). Improving mapreduce energy efficiency for computation intensive workloads. Paper presented at the 2011 international green computing conference and workshops (IGCC).
Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., et al. (2010). Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware 2010 (pp. 1–20). Berlin: Springer.
Yan, F., Cherkasova, L., Zhang, Z., & Smirni, E. (2014). Heterogeneous cores for mapreduce processing: Opportunity or challenge? Paper presented at the proceedings of IEEE/IFIP NOMS.
Yang, S.-J., & Chen, Y.-R. (2015). Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds. Journal of Network and Computer Applications, 57, 61–70. doi:10.1016/j.jnca.2015.07.012.
Article Google Scholar
Yazti, D. Z., & Krishnaswamy, S. (2014). Mobile big data analytics: Research, practice, and opportunities. Paper presented at the 2014 IEEE 15th international conference on mobile data management (MDM).
Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47–68.
Article Google Scholar
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving map reduce performance in heterogeneous environments. In OSDI 8(4), 7.
Google Scholar
Zhifeng, X., & Yang, X. (2013). Security and privacy in cloud computing. Communications Surveys & Tutorials, IEEE, 15(2), 843–859.
Article Google Scholar
Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-A., Chaiken, R., & Shakib, D. (2012). SCOPE: Parallel databases meet MapReduce. The VLDB Journal—The International Journal on Very Large Data Bases, 21(5), 611–636.
Article Google Scholar
Zhu, H. P., Xu, Y., Liu, Q., & Rao, Y. Q. (2014). Cloud service platform for big data of manufacturing. Applied Mechanics and Materials, 456, 178–183.
Article Google Scholar

Download references

Acknowledgments

This paper is financially supported by the Malaysian Ministry of Education under the University of Malaya High Impact Research Grant UM.C/625/1/HIR/MoE/FCSIT/03

Author information

Authors and Affiliations

Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, 50603, Malaysia
Ibrahim Abaker Targio Hashem, Nor Badrul Anuar, Abdullah Gani & Ibrar Yaqoob
School of Software, Dalian University of Technology, Dalian, 116620, China
Feng Xia
NDSU-CIIT Green Computing and Communications, North Dakota State University, Fargo, ND, 58108, USA
Samee Ullah Khan

Authors

Ibrahim Abaker Targio Hashem
View author publications
You can also search for this author in PubMed Google Scholar
Nor Badrul Anuar
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah Gani
View author publications
You can also search for this author in PubMed Google Scholar
Ibrar Yaqoob
View author publications
You can also search for this author in PubMed Google Scholar
Feng Xia
View author publications
You can also search for this author in PubMed Google Scholar
Samee Ullah Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ibrahim Abaker Targio Hashem or Nor Badrul Anuar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hashem, I.A.T., Anuar, N.B., Gani, A. et al. MapReduce: Review and open challenges. Scientometrics 109, 389–422 (2016). https://doi.org/10.1007/s11192-016-1945-y

Download citation

Received: 01 February 2016
Published: 15 April 2016
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11192-016-1945-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce: Review and open challenges

Abstract

Access this article

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation