MapReduce Data Skewness Handling: A Systematic Literature Review

Irandoost, Mohammad Amin; Rahmani, Amir Masoud; Setayeshi, Saeed

doi:10.1007/s10766-019-00627-0

MapReduce Data Skewness Handling: A Systematic Literature Review

Published: 23 January 2019

Volume 47, pages 907–950, (2019)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Mohammad Amin Irandoost¹,
Amir Masoud Rahmani¹ &
Saeed Setayeshi²

686 Accesses
Explore all metrics

Abstract

One of the most successful techniques in large-scale data-intensive computations is MapReduce programming. MapReduce is based on a divide and conquer approach that uses commodity computers, also known as nodes, for parallel processing. The scalability and performance of this technique are more related to the type of data distribution in map and reduce tasks. Because of many reasons such as node failure, network failure, data skewness, etc. completion time of one task could be longer than other tasks, job completion time is determined by the slowest task. One of the most important reasons for requiring more time to finish one task compared to other tasks is the skewness of data. Despite the widespread use of MapReduce because of its high flexibility and tolerability of the error, with the presence of data skewness, it cannot fully utilize the nodes for parallel processing. The objectives of this study were to review related articles and classify them based on the type of problem addressed and to determine the advantages and disadvantages of them. Open issues were also defined to present guidelines for future research on this subject. In order to achieve the aforementioned objectives, some research questions were defined and answered. In this review, it was concluded that there are important parameters have not been considered in MapReduce data skewness handling approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study on Improvement of MapReduce Performance with Skewed Data

A Comparative Investigation of Sample Versus Normal Map for Effective BigData Processing

CSRA: An Efficient Resource Allocation Algorithm in MapReduce Considering Data Skewness

References

Li, J., Liu, Y., Pan, J., Zhang, P., Chen, W., Wang, L.: Map-Balance-Reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener. Comput. Syst. (2017). https://doi.org/10.1016/j.future.2017.03.013
Article Google Scholar
Zhang, F., Malluhi, Q.M., Elsayed, T., Khan, S.U., Li, K., Zomaya, A.Y.: CloudFlow: a data-aware programming model for cloud workflow applications on modern HPC systems. Future Gener. Comput. Syst. 51, 98–110 (2015)
Article Google Scholar
Hwang, K., Xu, Z.: Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill Inc, New York (1998)
MATH Google Scholar
Jin, H.: High Performance Mass Storage and Parallel I/O: Technologies and Applications. Wiley, Hoboken (2001)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Hadoop, A.: Retrieved from http://hadoop.apache.org (2011). Accessed Jan 2018
Zhang, X., Wu, Y., Zhao, C.: MrHeter: improving MapReduce performance in heterogeneous environments. Clust. Comput. 19(4), 1691–1701 (2016)
Article Google Scholar
Chen, Q., Yao, J., Xiao, Z.: LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans. Parallel Distrib. Syst. 26(9), 2520–2533 (2015)
Article Google Scholar
Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique. In: Proceedings of the Sixth International Workshop on Data Intensive Distributed Computing, Vancouver, BC, Canada 2014, pp. 21–28. ACM, 2608021
Ibrahim, S., Jin, H., Lu, L., Wu, S., He, B., Qi, L.: LEEN: locality/fairness-aware key partitioning for MapReduce in the cloud. In: Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, Nov. 30 2010–Dec. 3 2010, pp. 17–24 (2010)
Jiadong, Y., Chen, H., Fei, H.: SASM: improving spark performance with adaptive skew mitigation. In: 2015 IEEE International Conference on Progress in Informatics and Computing (PIC), 18–20 Dec. 2015, pp. 102–107 (2015)
Xu, Y., Qu, W., Li, Z., Liu, Z., Ji, C., Li, Y., Li, H.: Balancing reducer workload for skewed data using sampling-based partitioning. Comput. Electr. Eng. 40(2), 675–687 (2014)
Article Google Scholar
Le, Y., Liu, J., Ergün, F., Wang, D.: Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, April 27 2014–May 2 2014, pp. 2004–2012 (2014)
Ibrahim, S., Jin, H., Lu, L., He, B., Antoniu, G., Wu, S.: Handling partitioning skew in MapReduce using LEEN. Peer-to-Peer Netw. Appl. 6(4), 409–424 (2013)
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, USA 2012, pp. 25–36. ACM, 2213840
Liu, Z., Zhang, Q., Zhani, M.F., Boutaba, R., Liu, Y., Gong, Z.: DREAMS: dynamic resource allocation for MapReduce with data skew. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 11–15 May 2015, pp. 18–26 (2015)
Asghari, P., Rahmani, A.M., Javadi, H.H.S.: Internet of things applications: a systematic review. Comput. Netw. 148, 241–261 (2019)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (2015)
Ren, K., Kwon, Y., Balazinska, M., Howe, B.: Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc. VLDB Endow. 6(10), 853–864 (2013)
Article Google Scholar
Soualhia, M., Khomh, F., Tahar, S.: Task scheduling in big data platforms: a systematic literature review. J. Syst. Softw. 134(Supplement C), 170–189 (2017)
Article Google Scholar
Li, R., Hu, H., Li, H., Wu, Y., Yang, J.: MapReduce parallel programming model: a state-of-the-art survey. Int. J. Parallel Program. 44(4), 832–866 (2016)
Article Google Scholar
Memishi, B., Ibrahim, S., Pérez, M.S., Antoniu, G.: Fault tolerance in MapReduce: a survey. In: Pop, F., Kołodziej, J., Di Martino, B. (eds.) Resource Management for Big Data Platforms: Algorithms, Modelling, and High-Performance Computing Techniques, pp. 205–240. Springer, Cham (2016)
Google Scholar
Kargar, M.J., Vakili, M.: Load balancing in MapReduce on homogeneous and heterogeneous clusters: an in-depth review. Int. J. Commun. Netw. Distrib. Syst. 15(2/3), 149–168 (2015)
Article Google Scholar
Becheikh, N., Landry, R., Amara, N.: Lessons from innovation empirical studies in the manufacturing sector: a systematic review of the literature from 1993–2003. Technovation 26(5–6), 644–664 (2006)
Article Google Scholar
Kupiainen, E., Mäntylä, M.V., Itkonen, J.: Using metrics in Agile and Lean software development: a systematic literature review of industrial studies. Inf. Softw. Technol. 62, 143–163 (2015)
Article Google Scholar
Geraldi, J., Maylor, H., Williams, T.: Now, let’s make it really complex (complicated): a systematic review of the complexities of projects. Int. J. Oper. Prod. Manag. 31(9), 966–990 (2011)
Article Google Scholar
Shojaiemehr, B., Rahmani, A.M., Qader, N.N.: Cloud computing service negotiation: a systematic review. Comput. Stand. Interfaces 55, 196–206 (2018)
Article Google Scholar
Souri, A., Navimipour, N.J., Rahmani, A.M.: Formal verification approaches and standards in the cloud computing: a comprehensive and systematic review. Comput. Stand. Interfaces 58, 1–22 (2017)
Article Google Scholar
Liroz-Gistau, M., Akbarinia, R., Agrawal, D., Valduriez, P.: FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf. Syst. 60, 69–84 (2016)
Article Google Scholar
Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, California 2012, pp. 1–14. ACM, 2391245
Slagter, K., Hsu, C.-H., Chung, Y.-C.: An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int. J. Parallel Program. 43(3), 489–507 (2015)
Article Google Scholar
Yan, W., Xue, Y., Malin, B.: Scalable and robust key group size estimation for reducer load balancing in MapReduce. In: 2013 IEEE International Conference on Big Data, 6–9 Oct. 2013, pp. 156–162 (2013)
Jiong, X., Shu, Y., Xiaojun, R., Zhiyang, D., Yun, T., Majors, J., Manzanares, A., Xiao, Q.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: 2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), 19–23 April 2010, pp. 1–9 (2010)
Guo, Z., Pierce, M., Fox, G., Zhou, M.: Automatic task re-organization in MapReduce. In: Cluster Computing (CLUSTER), 2011 IEEE International Conference on 2011, pp. 335–343. IEEE
Irandoost, M.A., Rahmani, A.M., Setayeshi, S.: A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf. Sci. (2018). https://doi.org/10.1016/j.ins.2018.11.007
Article Google Scholar
Gao, Y., Zhang, Y., Wang, H., Li, J., Gao, H.: A distributed load balance algorithm of MapReduce for data quality detection. In: Gao, H., Kim, J., Sakurai, Y. (eds.) Database Systems for Advanced Applications: DASFAA 2016 International Workshops: BDMS, BDQM, MoI, and SeCoP, Dallas, TX, USA, April 16–19, 2016, Proceedings, pp. 294–306. Springer International Publishing, Cham (2016)
Kolb, L., Thor, A., Rahm, E.: Load balancing for MapReduce-based entity resolution. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering 2012, pp. 618–629. IEEE Computer Society, 2310387
Xu, Y., Zou, P., Qu, W., Li, Z., Li, K., Cui, X.: Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh ChinaGrid Annual Conference, 20–23 Sept. 2012, pp. 1–8 (2012)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Paper Presented at the Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA
Spark: http://spark.apache.org/ (2015-11-30). Accessed Jan 2018
Tang, Z., Zhang, X., Li, K., Li, K.: An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener. Comput. Syst. 78, 287–301 (2016)
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM symposium on Cloud computing, Indianapolis, Indiana, USA 2010, pp. 75–86. ACM, 1807140
Liu, Z., Zhang, Q., Boutaba, R., Liu, Y., Wang, B.: OPTIMA: on-line partitioning skew mitigation for MapReduce with resource adjustment. J. Netw. Syst. Manag. 24(4), 859–883 (2016)
Article Google Scholar
Arning, A., Agrawal, R., Raghavan, P.: A linear method for deviation detection in large databases. In: KDD 1996, pp. 164–169
Liu, Z., Zhang, Q., Boutaba, R., Liu, Y., Gong, Z.: ROUTE: run-time robust reducer workload estimation for MapReduce. Int. J. Netw. Manag. 26(3), 224–244 (2016)
Article Google Scholar
Kumaresan, V., Baskaran, R., Dhavachelvan, P.: AEGEUS++: an energy-aware online partition skew mitigation algorithm for mapreduce in cloud. Clust. Comput. (2017). https://doi.org/10.1007/s10586-017-1044-8
Article Google Scholar
Kumaresan, V., Baskaran, R.: AEGEUS: an online partition skew mitigation algorithm for mapreduce. In: Proceedings of the International Conference on Informatics and Analytics, Pondicherry, India 2016, pp. 1–8. ACM, 2980461
Slagter, K., Hsu, C.-H., Chung, Y.-C., Zhang, D.: An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J. Supercomput. 66(1), 539–555 (2013)
Article Google Scholar
Liu, G., Zhu, X., Wang, J., Guo, D., Bao, W., Guo, H.: SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Gener. Comput. Syst. 86, 1054–1063 (2017)
Article Google Scholar
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: 2012 IEEE 28th International Conference on Data Engineering 2012, pp. 522–533. IEEE
Fan, Y., Wu, W., Xu, Y., Chen, H.: Improving MapReduce performance by balancing skewed loads. China Commun. 11(8), 85–108 (2014)
Article Google Scholar
Guo, Y., Rao, J., Cheng, D., Zhou, X.: ishuffle: Improving hadoop performance with shuffle-on-write. IEEE Trans. Parallel Distrib. Syst. 28(6), 1649–1662 (2017)
Article Google Scholar
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, California 2012, pp. 1–14. ACM, 2391233
Nawale, V.A., Deshpande, P.: Minimizing skew in MapReduce applications using node clustering in heterogeneous environment. In: 2015 International Conference on Computational Intelligence and Communication Networks (CICN), 12–14 Dec. 2015, pp. 136–139 (2015)
Zheng, S., Liu, Y., He, T., Shanshan, L., Liao, X.: SkewControl: Gini out of the bottle. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops, 19–23 May 2014, pp. 1572–1580 (2014)
Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: a resource savvy approach for handling skew in mapreduce applications. In: Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on 2013, pp. 652–660. IEEE
Chen, L., Lu, W., Che, X., Xing, W., Wang, L., Yang, Y.: MRSIM: mitigating reducer skew In MapReduce. In: 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA), 27–29 March 2017, pp. 379–384 (2017)
Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapreduce using distributed memory. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, Utah, USA 2014, pp. 551–562. ACM, 2595634
Huang, T.C., Chu, K.C., Huang, G.H., Shen, Y.C., Shieh, C.K.: Smart partitioning mechanism for dealing with intermediate data skew in reduce task on cloud computing. In: 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), 27–29 March 2017, pp. 819–826 (2017)
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.N.: Tarazu: optimizing MapReduce on heterogeneous clusters. SIGARCH Comput. Archit. News 40(1), 61–74 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Mohammad Amin Irandoost & Amir Masoud Rahmani
Department of Medical Radiation Engineering, Amirkabir University of Technology, Tehran, Iran
Saeed Setayeshi

Authors

Mohammad Amin Irandoost
View author publications
You can also search for this author in PubMed Google Scholar
Amir Masoud Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Saeed Setayeshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Masoud Rahmani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Irandoost, M.A., Rahmani, A.M. & Setayeshi, S. MapReduce Data Skewness Handling: A Systematic Literature Review. Int J Parallel Prog 47, 907–950 (2019). https://doi.org/10.1007/s10766-019-00627-0

Download citation

Received: 06 September 2018
Accepted: 16 January 2019
Published: 23 January 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10766-019-00627-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce Data Skewness Handling: A Systematic Literature Review

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Study on Improvement of MapReduce Performance with Skewed Data

A Comparative Investigation of Sample Versus Normal Map for Effective BigData Processing

CSRA: An Efficient Resource Allocation Algorithm in MapReduce Considering Data Skewness

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now