Abstract
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. The MapReduce computational paradigm is a major enabler for underlying numerous big data platforms. MapReduce is a popular tool for the distributed and scalable processing of big data. It is increasingly being used in different applications primarily because of its important features, including scalability, fault tolerance, ease of programming, and flexibility. Thus, bibliometric analysis and review was conducted to evaluate the trend of MapReduce research assessment publications indexed in Scopus from 2006 to 2015. This trend includes the use of the MapReduce framework for big data processing and its development. The study analyzed the distribution of published articles, countries, authors, keywords, and authorship pattern. For data visualization, VOSviewer program was used to produce distance- and graph-based maps. The top 10 most cited articles were also identified based on the citation count of publications. The study utilized productivity measures, domain visualization techniques and co-word to explore papers related to MapReduce in the field of big data. Moreover, the study discussed the most influential articles contributed to the improvements in MapReduce and reviewed the corresponding solutions. Finally, it presented several open challenges on big data processing with MapReduce as future research directions.











References
Afrati, F., Dolev, S., Korach, E., Sharma, S., & Ullman, J. D. (2015). Assignment problems of different-sized inputs in mapreduce. arXiv:1507.04461.
Ahmad, F., Lee, S., Thottethodi, M., & Vijaykumar, T. (2013). MapReduce with communication overlap (MaRCO). Journal of Parallel and Distributed Computing, 73(5), 608–620.
Anjos, J. C., Carrera, I., Kolberg, W., Tibola, A. L., Arantes, L. B., & Geyer, C. R. (2015). MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems, 42, 22–35.
Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., et al. (2011). Jaql: A scripting language for large scale semistructured data analysis. Proceedings of VLDB conference, 4(12), 1272–1283.
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., & Pasquin, R. (2011). Incoop: MapReduce for incremental computations. Proceedings of the 2nd ACM Symposium on Cloud Computing. doi:10.1145/2038916.2038923.
Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data. Washington, DC: Aspen Institute, Communications and Society Program.
Bu, Y., Howe, B., Balazinska, M., & Ernst, M. D. (2010). HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1–2), 285–296.
Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., et al. (2014). Hawq: A massively parallel processing sql engine in hadoop. Paper presented at the proceedings of the 2014 ACM SIGMOD international conference on management of data.
Chen, S. (2010). Cheetah: A high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 1459–1468.
Chen, R., & Chen, H. (2013). Tiled-MapReduce: Efficient and flexible MapReduce processing on multicore with tiling. ACM Transactions on Architecture and Code Optimization (TACO), 10(1), 3.
Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347. doi:10.1016/j.ins.2014.01.015.
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.
Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C. (2014). Optimized big data K-means clustering using MapReduce. The Journal of Supercomputing, 70(3), 1249–1259.
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.
Dean, J., & Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72–77.
Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37(6), 817–842. doi:10.1016/S0306-4573(00)00051-0.
Ding, L., Wang, G., Xin, J., Wang, X., Huang, S., & Zhang, R. (2013). ComMapReduce: An improvement of mapreduce with lightweight communication mechanisms. Data & Knowledge Engineering, 88, 224–247.
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1–2), 515–529.
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., & Schad, J. (2012). Only aggressive elephants are fast elephants. Proceedings of the VLDB Endowment, 5(11), 1591–1602.
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., & Fox, G. (2010). Twister: A runtime for iterative mapreduce. Paper presented at the proceedings of the 19th ACM international symposium on high performance distributed computing.
Falagas, M. E., Pitsouni, E. I., Malietzis, G. A., & Pappas, G. (2008). Comparison of PubMed, Scopus, web of science, and Google scholar: Strengths and weaknesses. The FASEB Journal, 22(2), 338–342.
Floratou, A., Patel, J. M., Shekita, E. J., & Tata, S. (2011). Column-oriented storage techniques for MapReduce. Proceedings of the VLDB Endowment, 4(7), 419–429.
Friedman, E., Pawlowski, P., & Cieslewicz, J. (2009). SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proceedings of the VLDB Endowment, 2(2), 1402–1413.
Fu, H.-Z., Wang, M.-H., & Ho, Y.-S. (2013). Mapping of drinking water research: A bibliometric analysis of research output during 1992–2011. Science of the Total Environment, 443, 757–765.
Gani, A., Siddiqa, A., Shamshirband, S., & Hanum, F. (2016). A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and Information Systems, 46(2), 241–284.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google file system. Paper presented at the ACM SIGOPS operating systems review.
Ghit, B., Yigitbasi, N., Iosup, A., & Epema, D. (2014). Balanced resource allocations across multiple dynamic MapReduce clusters. Paper presented at the ACM SIGMETRICS.
Greenspan, J., & Valkova, S. (2014). Using big healthcare data for ILI situational awareness in Georgia. Online Journal of Public Health Informatics, 6(1). doi:10.5210/ojphi.v6i1.5193.
Gu, R., Yang, X., Yan, J., Sun, Y., Wang, B., Yuan, C., & Huang, Y. (2014). SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. Journal of Parallel and Distributed Computing, 74(3), 2166–2179.
Gunarathne, T., Wu, T.-L., Qiu, J., & Fox, G. (2010). MapReduce in the clouds for science. Paper presented at the 2010 IEEE second international conference on cloud computing technology and science (CloudCom).
Gunarathne, T., Zhang, B., Wu, T.-L., & Qiu, J. (2013). Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Generation Computer Systems, 29(4), 1035–1048.
Hadoop, A. (2011). Apache Hadoop. Retrieved from https://hadoop.apache.org/.
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., & Xu, Z. (2011). RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. Paper presented at the 2011 IEEE 27th international conference on data engineering (ICDE).
Hsu, C.-H. (2014). Intelligent big data processing. Future Generation Computer Systems, 36, 16–18. doi:10.1016/j.future.2014.02.003.
Hu, S., Liu, W., Rabl, T., Huang, S., Liang, Y., Xiao, Z., et al. (2014). DualTable: A hybrid storage model for update optimization in hive. arXiv preprint arXiv:1404.6878.
Ibrahim, S., Phan, T.-D., Carpen-Amarie, A., Chihoub, H.-E., Moise, D., & Antoniu, G. (2016). Governing energy consumption in Hadoop through CPU frequency scaling: An analysis. Future Generation Computer Systems. doi:10.1016/j.future.2015.01.005.
Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., & Wu, S. (2012). Maestro: Replica-aware map scheduling for mapreduce. Paper presented at the 2012 12th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid).
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., & Goldberg, A. (2009). Quincy: Fair scheduling for distributed computing clusters. Paper presented at the proceedings of the ACM SIGOPS 22nd symposium on operating systems principles.
Jiang, H., Chen, Y., Qiao, Z., Weng, T.-H., & Li, K.-C. (2014). Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Computing, 18(1), 1–15.
Jindal, A., Quiané-Ruiz, J.-A., & Dittrich, J. (2011). Trojan data layouts: Right shoes for a running elephant. Paper presented at the proceedings of the 2nd ACM symposium on cloud computing.
Kalavri, V., & Vlassov, V. (2013). Mapreduce: Limitations, optimizations and open issues. Paper presented at the 2013 12th IEEE international conference on trust, security and privacy in computing and communications (TrustCom).
Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–2573.
Kim, G.-H., Trimi, S., & Chung, J.-H. (2014). Big-data applications in the government sector. Communications of the ACM, 57(3), 78–85.
Labrinidis, A., & Jagadish, H. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5(12), 2032–2033.
Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12), 1814–1825. doi:10.14778/2367502.2367520.
Lama, P., & Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. Paper presented at the proceedings of the 9th international conference on Autonomic computing.
Lämmel, R. (2008). Google’s MapReduce programming model—Revisited. Science of Computer Programming, 70(1), 1–30. doi:10.1016/j.scico.2007.07.001.
Lee, D., Kim, J.-S., & Maeng, S. (2014). Large-scale incremental processing with MapReduce. Future Generation Computer Systems, 36, 66–79. doi:10.1016/j.future.2013.09.010.
Lin, L., Lychagina, V., Liu, W., Kwon, Y., Mittal, S., & Wong, M. (2011a). Tenzing a sql implementation on the mapreduce framework. Proceedings of the VLDB Endowment, 4(12), 1318–1327.
Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., & Wu, S. (2011b). Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. Paper presented at the proceedings of the 2011 ACM SIGMOD international conference on management of data.
Lin, M., Zhang, L., Wierman, A., & Tan, J. (2013). Joint optimization of overlapping phases in MapReduce. Performance Evaluation, 70(10), 720–735.
Lyon, D. (2014). Surveillance, snowden, and big data: Capacities, consequences, critique. Big Data & Society, 1(2), 2053951714541861.
Maheshwari, N., Nanduri, R., & Varma, V. (2012). Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework. Future Generation Computer Systems, 28(1), 119–127.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.
Mao, G., Zou, H., Chen, G., Du, H., & Zuo, J. (2015). Past, current and future of biomass energy research: A bibliometric analysis. Renewable and Sustainable Energy Reviews, 52, 1823–1833. doi:10.1016/j.rser.2015.07.141.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D., & Barton, D. (2012). Big Data. The management revolution. Harvard Bus Rev, 90(10), 61–67.
McCreadie, R., Macdonald, C., & Ounis, I. (2012). MapReduce indexing strategies: Studying scalability and efficiency. Information Processing and Management, 48(5), 873–888.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., & Daly, M. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303.
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of Science versus Scopus and Google Scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.
Mihaylov, S. R., Ives, Z. G., & Guha, S. (2012). REX: Recursive, delta-based data-centric computation. Proceedings of the VLDB Endowment, 5(11), 1280–1291.
Murthy, A. C., Douglas, C., Konar, M., O’Malley, O., Radia, S., Agarwal, S., et al. (2011). Architecture of next generation Apache Hadoop MapReduce framework. Technical report, Apache Hadoop.
Murthy, A. C., Vavilapalli, V. K., Eadline, D., Niemiec, J., & Markham, J. (2013). Apache Hadoop YARN: Moving beyond MapReduce and batch processing with Apache Hadoop 2. Boca Raton: Taylor & Francis.
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., & Koudas, N. (2010). MRShare: Sharing across multiple queries in MapReduce. Proceedings of the VLDB Endowment, 3(1–2), 494–505.
Olston, C., Reed, B., Srivastava, U., Kumar, R., & Tomkins, A. (2008). Pig latin: A not-so-foreign language for data processing. Paper presented at the proceedings of the 2008 ACM SIGMOD international conference on management of data.
Pike, R., Dorward, S., Griesemer, R., & Quinlan, S. (2005). Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4), 277–298.
Polato, I., Ré, R., Goldman, A., & Kon, F. (2014). A comprehensive view of Hadoop research—A systematic literature review. Journal of Network and Computer Applications, 46, 1–25. doi:10.1016/j.jnca.2014.07.022.
Qi, C., Cheng, L., & Zhen, X. (2014). Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers, 63(4), 954–967. doi:10.1109/TC.2013.15.
Rasooli, A., & Down, D. G. (2014). COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems, 36, 1–15.
Richter, S., Quiané-Ruiz, J.-A., Schuh, S., & Dittrich, J. (2012). Towards zero-overhead adaptive indexing in Hadoop. arXiv preprint arXiv:1212.3480.
Rothstein, M. A. (2015). Ethical Issues in Big Data Health Research. Journal of Law, Medicine and Ethics, 43(2), 425–429.
Sakr, S., Liu, A., & Fayoumi, A. G. (2013). The family of MapReduce and large-scale data processing systems. ACM Computing Surveys (CSUR), 46(1), 11.
Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. Paper presented at the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).
Srirama, S. N., Jakovits, P., & Vainikko, E. (2012). Adapting scientific computing problems to clouds using MapReduce. Future Generation Computer Systems, 28(1), 184–192.
Sun, J., Wang, M.-H., & Ho, Y.-S. (2012). A historical review and bibliometric analysis of research on estuary pollution. Marine Pollution Bulletin, 64(1), 13–21.
Talia, D. (2013). Clouds for scalable big data analytics. Computer, 46(5), 98–101. doi:10.1109/MC.2013.162.
Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626–1629.
van Eck, N., & Waltman, L. (2009). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., et al. (2013). Apache hadoop yarn: Yet another resource negotiator. Paper presented at the proceedings of the 4th annual symposium on cloud computing.
Verma, A., Cherkasova, L., & Campbell, R. H. (2011). ARIA: Automatic resource inference and allocation for mapreduce environments. Paper presented at the proceedings of the 8th ACM international conference on autonomic computing.
White, T. (2009). Hadoop: The definitive guide: The definitive guide. Sebastopol: O’Reilly Media.
Wirtz, T., & Ge, R. (2011). Improving mapreduce energy efficiency for computation intensive workloads. Paper presented at the 2011 international green computing conference and workshops (IGCC).
Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., et al. (2010). Flex: A slot allocation scheduling optimizer for mapreduce workloads. In Middleware 2010 (pp. 1–20). Berlin: Springer.
Yan, F., Cherkasova, L., Zhang, Z., & Smirni, E. (2014). Heterogeneous cores for mapreduce processing: Opportunity or challenge? Paper presented at the proceedings of IEEE/IFIP NOMS.
Yang, S.-J., & Chen, Y.-R. (2015). Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds. Journal of Network and Computer Applications, 57, 61–70. doi:10.1016/j.jnca.2015.07.012.
Yazti, D. Z., & Krishnaswamy, S. (2014). Mobile big data analytics: Research, practice, and opportunities. Paper presented at the 2014 IEEE 15th international conference on mobile data management (MDM).
Zhang, Y., Gao, Q., Gao, L., & Wang, C. (2012). imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 10(1), 47–68.
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving map reduce performance in heterogeneous environments. In OSDI 8(4), 7.
Zhifeng, X., & Yang, X. (2013). Security and privacy in cloud computing. Communications Surveys & Tutorials, IEEE, 15(2), 843–859.
Zhou, J., Bruno, N., Wu, M.-C., Larson, P.-A., Chaiken, R., & Shakib, D. (2012). SCOPE: Parallel databases meet MapReduce. The VLDB Journal—The International Journal on Very Large Data Bases, 21(5), 611–636.
Zhu, H. P., Xu, Y., Liu, Q., & Rao, Y. Q. (2014). Cloud service platform for big data of manufacturing. Applied Mechanics and Materials, 456, 178–183.
Acknowledgments
This paper is financially supported by the Malaysian Ministry of Education under the University of Malaya High Impact Research Grant UM.C/625/1/HIR/MoE/FCSIT/03
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Hashem, I.A.T., Anuar, N.B., Gani, A. et al. MapReduce: Review and open challenges. Scientometrics 109, 389–422 (2016). https://doi.org/10.1007/s11192-016-1945-y
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-016-1945-y