Skip to main content
Log in

MapReduce: Review and open challenges

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. The MapReduce computational paradigm is a major enabler for underlying numerous big data platforms. MapReduce is a popular tool for the distributed and scalable processing of big data. It is increasingly being used in different applications primarily because of its important features, including scalability, fault tolerance, ease of programming, and flexibility. Thus, bibliometric analysis and review was conducted to evaluate the trend of MapReduce research assessment publications indexed in Scopus from 2006 to 2015. This trend includes the use of the MapReduce framework for big data processing and its development. The study analyzed the distribution of published articles, countries, authors, keywords, and authorship pattern. For data visualization, VOSviewer program was used to produce distance- and graph-based maps. The top 10 most cited articles were also identified based on the citation count of publications. The study utilized productivity measures, domain visualization techniques and co-word to explore papers related to MapReduce in the field of big data. Moreover, the study discussed the most influential articles contributed to the improvements in MapReduce and reviewed the corresponding solutions. Finally, it presented several open challenges on big data processing with MapReduce as future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

References

  • Afrati, F., Dolev, S., Korach, E., Sharma, S., & Ullman, J. D. (2015). Assignment problems of different-sized inputs in mapreduce. arXiv:1507.04461.

  • Ahmad, F., Lee, S., Thottethodi, M., & Vijaykumar, T. (2013). MapReduce with communication overlap (MaRCO). Journal of Parallel and Distributed Computing, 73(5), 608–620.

    Article  Google Scholar 

  • Anjos, J. C., Carrera, I., Kolberg, W., Tibola, A. L., Arantes, L. B., & Geyer, C. R. (2015). MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems, 42, 22–35.

    Article  Google Scholar 

  • Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., et al. (2011). Jaql: A scripting language for large scale semistructured data analysis. Proceedings of VLDB conference4(12), 1272–1283.

  • Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Proceedings of the 2nd ACM Symposium on Cloud Computing. doi:10.1145/2038916.2038923.

  • Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data. Washington, DC: Aspen Institute, Communications and Society Program.

    Google Scholar 

  • Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1–2), 285–296.

    Article  Google Scholar 

  • Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., et al. (2014). Hawq: A massively parallel processing sql engine in hadoop. Paper presented at the proceedings of the 2014 ACM SIGMOD international conference on management of data.

  • Chen, S. (2010). Cheetah: A high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 1459–1468.

    Article  Google Scholar 

  • Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3.

    Google Scholar 

  • Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347. doi:10.1016/j.ins.2014.01.015.

    Article  Google Scholar 

  • Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.

    Article  MathSciNet  Google Scholar 

  • Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014). Optimized big data K-means clustering using MapReduce. The Journal of Supercomputing, 70(3), 1249–1259.

    Article  Google Scholar 

  • Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

    Article  Google Scholar 

  • Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.

    Article  Google Scholar 

  • Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37(6), 817–842. doi:10.1016/S0306-4573(00)00051-0.

    Article  MATH  Google Scholar 

  • Ding, L., Wang, G., Xin, J., Wang, X., Huang, S., & Zhang, R. (2013). ComMapReduce: An improvement of mapreduce with lightweight communication mechanisms. Data & Knowledge Engineering, 88, 224–247.

    Article  Google Scholar 

  • Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1–2), 515–529.

    Article  Google Scholar 

  • Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., & Schad, J. (2012). Only aggressive elephants are fast elephants. Proceedings of the VLDB Endowment, 5(11), 1591–1602.

    Article  Google Scholar 

  • Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: A runtime for iterative mapreduce. Paper presented at the proceedings of the 19th ACM international symposium on high performance distributed computing.

  • Falagas, M. E., Pitsouni, E. I., Malietzis, G. A., & Pappas, G. (2008). Comparison of PubMed, Scopus, web of science, and Google scholar: Strengths and weaknesses. The FASEB Journal, 22(2), 338–342.

    Article  Google Scholar 

  • Floratou, A., Patel, J. M., Shekita, E. J., & Tata, S. (2011). Column-oriented storage techniques for MapReduce. Proceedings of the VLDB Endowment, 4(7), 419–429.

    Article  Google Scholar 

  • Friedman, E., Pawlowski, P., & Cieslewicz, J. (2009). SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proceedings of the VLDB Endowment, 2(2), 1402–1413.

    Article  Google Scholar 

  • Fu, H.-Z., Wang, M.-H., & Ho, Y.-S. (2013). Mapping of drinking water research: A bibliometric analysis of research output during 1992–2011. Science of the Total Environment, 443, 757–765.

    Article  Google Scholar 

  • Gani, A., Siddiqa, A., Shamshirband, S., & Hanum, F. (2016). A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and Information Systems, 46(2), 241–284.

    Article  Google Scholar 

  • Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.

  • Ghit, B., Yigitbasi, N., Iosup, A., & Epema, D. (2014). Balanced resource allocations across multiple dynamic MapReduce clusters. Paper presented at the ACM SIGMETRICS.

  • Greenspan, J., & Valkova, S. (2014). Using big healthcare data for ILI situational awareness in Georgia. Online Journal of Public Health Informatics, 6(1). doi:10.5210/ojphi.v6i1.5193.

  • Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., & Huang, Y. (2014). SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. Journal of Parallel and Distributed Computing, 74(3), 2166–2179.

    Article  Google Scholar 

  • Gunarathne, T., Wu, T.-L., Qiu, J., & Fox, G. (2010). MapReduce in the clouds for science. Paper presented at the 2010 IEEE second international conference on cloud computing technology and science (CloudCom).

  • Gunarathne, T., Zhang, B., Wu, T.-L., & Qiu, J. (2013). Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Generation Computer Systems, 29(4), 1035–1048.

    Article  Google Scholar 

  • Hadoop, A. (2011). Apache Hadoop.  Retrieved from https://hadoop.apache.org/.

  • He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., & Xu, Z. (2011). RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. Paper presented at the 2011 IEEE 27th international conference on data engineering (ICDE).

  • Hsu, C.-H. (2014). Intelligent big data processing. Future Generation Computer Systems, 36, 16–18. doi:10.1016/j.future.2014.02.003.

    Article  Google Scholar 

  • Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., et al. (2014). DualTable: A hybrid storage model for update optimization in hive. arXiv preprint arXiv:1404.6878.

  • Ibrahim, S., Phan, T.-D., Carpen-Amarie, A., Chihoub, H.-E., Moise, D., & Antoniu, G. (2016). Governing energy consumption in Hadoop through CPU frequency scaling: An analysis. Future Generation Computer Systems. doi:10.1016/j.future.2015.01.005.

  • Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., & Wu, S. (2012). Maestro: Replica-aware map scheduling for mapreduce. Paper presented at the 2012 12th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid).

  • Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. Paper presented at the proceedings of the ACM SIGOPS 22nd symposium on operating systems principles.

  • Jiang, H., Chen, Y., Qiao, Z., Weng, T.-H., & Li, K.-C. (2014). Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing, 18(1), 1–15.

    Google Scholar 

  • Jindal, A., Quiané-Ruiz, J.-A., & Dittrich, J. (2011). Trojan data layouts: Right shoes for a running elephant. Paper presented at the proceedings of the 2nd ACM symposium on cloud computing.

  • Kalavri, V., & Vlassov, V. (2013). Mapreduce: Limitations, optimizations and open issues. Paper presented at the 2013 12th IEEE international conference on trust, security and privacy in computing and communications (TrustCom).

  • Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–2573.

    Article  Google Scholar 

  • Kim, G.-H., Trimi, S., & Chung, J.-H. (2014). Big-data applications in the government sector. Communications of the ACM, 57(3), 78–85.

    Article  Google Scholar 

  • Labrinidis, A., & Jagadish, H. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5(12), 2032–2033.

    Article  Google Scholar 

  • Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814–1825. doi:10.14778/2367502.2367520.

    Article  Google Scholar 

  • Lama, P., & Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. Paper presented at the proceedings of the 9th international conference on Autonomic computing.

  • Lämmel, R. (2008). Google’s MapReduce programming model—Revisited. Science of Computer Programming, 70(1), 1–30. doi:10.1016/j.scico.2007.07.001.

    Article  MathSciNet  MATH  Google Scholar 

  • Lee, D., Kim, J.-S., & Maeng, S. (2014). Large-scale incremental processing with MapReduce. Future Generation Computer Systems, 36, 66–79. doi:10.1016/j.future.2013.09.010.

    Article  Google Scholar 

  • Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., & Wong, M. (2011a). Tenzing a sql implementation on the mapreduce framework. Proceedings of the VLDB Endowment, 4(12), 1318–1327.

  • Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011b). Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. Paper presented at the proceedings of the 2011 ACM SIGMOD international conference on management of data.

  • Lin, M., Zhang, L., Wierman, A., & Tan, J. (2013). Joint optimization of overlapping phases in MapReduce. Performance Evaluation, 70(10), 720–735.

    Article  Google Scholar 

  • Lyon, D. (2014). Surveillance, snowden, and big data: Capacities, consequences, critique. Big Data & Society, 1(2), 2053951714541861.

    Article  Google Scholar 

  • Maheshwari, N., Nanduri, R., & Varma, V. (2012). Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Generation Computer Systems, 28(1), 119–127.

    Article  Google Scholar 

  • Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

  • Mao, G., Zou, H., Chen, G., Du, H., & Zuo, J. (2015). Past, current and future of biomass energy research: A bibliometric analysis. Renewable and Sustainable Energy Reviews, 52, 1823–1833. doi:10.1016/j.rser.2015.07.141.

    Article  Google Scholar 

  • McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D., & Barton, D. (2012). Big Data. The management revolution. Harvard Bus Rev, 90(10), 61–67.

    Google Scholar 

  • McCreadie, R., Macdonald, C., & Ounis, I. (2012). MapReduce indexing strategies: Studying scalability and efficiency. Information Processing and Management, 48(5), 873–888.

    Article  Google Scholar 

  • McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., & Daly, M. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303.

    Article  Google Scholar 

  • Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.

    Article  Google Scholar 

  • Mihaylov, S. R., Ives, Z. G., & Guha, S. (2012). REX: Recursive, delta-based data-centric computation. Proceedings of the VLDB Endowment, 5(11), 1280–1291.

    Article  Google Scholar 

  • Murthy, A. C., Douglas, C., Konar, M., O’Malley, O., Radia, S., Agarwal, S., et al. (2011). Architecture of next generation Apache Hadoop MapReduce framework. Technical report, Apache Hadoop.

  • Murthy, A. C., Vavilapalli, V. K., Eadline, D., Niemiec, J., & Markham, J. (2013). Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2. Boca Raton: Taylor & Francis.

  • Nykiel, T., Potamias, M., Mishra, C., Kollios, G., & Koudas, N. (2010). MRShare: Sharing across multiple queries in MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 494–505.

    Article  MATH  Google Scholar 

  • Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: A not-so-foreign language for data processing. Paper presented at the proceedings of the 2008 ACM SIGMOD international conference on management of data.

  • Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. (2005). Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4), 277–298.

    Article  Google Scholar 

  • Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1–25. doi:10.1016/j.jnca.2014.07.022.

    Article  Google Scholar 

  • Qi, C., Cheng, L., & Zhen, X. (2014). Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers, 63(4), 954–967. doi:10.1109/TC.2013.15.

    Article  MathSciNet  Google Scholar 

  • Rasooli, A., & Down, D. G. (2014). COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, 36, 1–15.

    Article  Google Scholar 

  • Richter, S., Quiané-Ruiz, J.-A., Schuh, S., & Dittrich, J. (2012). Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480.

  • Rothstein, M. A. (2015). Ethical Issues in Big Data Health Research. Journal of Law, Medicine and Ethics, 43(2), 425–429.

    Article  Google Scholar 

  • Sakr, S., Liu, A., & Fayoumi, A. G. (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 11.

    Article  Google Scholar 

  • Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. Paper presented at the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).

  • Srirama, S. N., Jakovits, P., & Vainikko, E. (2012). Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems, 28(1), 184–192.

    Article  Google Scholar 

  • Sun, J., Wang, M.-H., & Ho, Y.-S. (2012). A historical review and bibliometric analysis of research on estuary pollution. Marine Pollution Bulletin, 64(1), 13–21.

    Article  Google Scholar 

  • Talia, D. (2013). Clouds for scalable big data analytics. Computer, 46(5), 98–101. doi:10.1109/MC.2013.162.

    Article  Google Scholar 

  • Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629.

    Article  Google Scholar 

  • van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.

    Google Scholar 

  • Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., et al. (2013). Apache hadoop yarn: Yet another resource negotiator. Paper presented at the proceedings of the 4th annual symposium on cloud computing.

  • Verma, A., Cherkasova, L., & Campbell, R. H. (2011). ARIA: Automatic resource inference and allocation for mapreduce environments. Paper presented at the proceedings of the 8th ACM international conference on autonomic computing.

  • White, T. (2009). Hadoop: The definitive guide: The definitive guide. Sebastopol: O’Reilly Media.

  • Wirtz, T., & Ge, R. (2011). Improving mapreduce energy efficiency for computation intensive workloads. Paper presented at the 2011 international green computing conference and workshops (IGCC).

  • Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., et al. (2010). Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware 2010 (pp. 1–20). Berlin: Springer.

  • Yan, F., Cherkasova, L., Zhang, Z., & Smirni, E. (2014). Heterogeneous cores for mapreduce processing: Opportunity or challenge? Paper presented at the proceedings of IEEE/IFIP NOMS.

  • Yang, S.-J., & Chen, Y.-R. (2015). Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds. Journal of Network and Computer Applications, 57, 61–70. doi:10.1016/j.jnca.2015.07.012.

    Article  Google Scholar 

  • Yazti, D. Z., & Krishnaswamy, S. (2014). Mobile big data analytics: Research, practice, and opportunities. Paper presented at the 2014 IEEE 15th international conference on mobile data management (MDM).

  • Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47–68.

    Article  Google Scholar 

  • Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving map reduce performance in heterogeneous environments. In OSDI 8(4), 7.

    Google Scholar 

  • Zhifeng, X., & Yang, X. (2013). Security and privacy in cloud computing. Communications Surveys & Tutorials, IEEE, 15(2), 843–859.

    Article  Google Scholar 

  • Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-A., Chaiken, R., & Shakib, D. (2012). SCOPE: Parallel databases meet MapReduce. The VLDB Journal—The International Journal on Very Large Data Bases, 21(5), 611–636.

    Article  Google Scholar 

  • Zhu, H. P., Xu, Y., Liu, Q., & Rao, Y. Q. (2014). Cloud service platform for big data of manufacturing. Applied Mechanics and Materials, 456, 178–183.

    Article  Google Scholar 

Download references

Acknowledgments

This paper is financially supported by the Malaysian Ministry of Education under the University of Malaya High Impact Research Grant UM.C/625/1/HIR/MoE/FCSIT/03

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ibrahim Abaker Targio Hashem or Nor Badrul Anuar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hashem, I.A.T., Anuar, N.B., Gani, A. et al. MapReduce: Review and open challenges. Scientometrics 109, 389–422 (2016). https://doi.org/10.1007/s11192-016-1945-y

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-016-1945-y

Keywords

Navigation