Reducing partition skew on MapReduce: an incremental allocation approach

Wang, Zhuo; Chen, Qun; Suo, Bo; Pan, Wei; Li, Zhanhuai

doi:10.1007/s11704-018-6586-2

Reducing partition skew on MapReduce: an incremental allocation approach

Research Article
Published: 17 June 2019

Volume 13, pages 960–975, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Zhuo Wang¹,
Qun Chen¹,
Bo Suo¹,
Wei Pan¹ &
…
Zhanhuai Li¹

170 Accesses
7 Citations
Explore all metrics

Abstract

MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data.

In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N²), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fair multi-agent task allocation for large datasets analysis

Article 19 July 2017

Skew Handling Technique for Scheduling Huge Data Mapper with High End Reducers in MapReduce Programming Model

Scheduling MapReduce Jobs on Identical and Unrelated Processors

Article 29 November 2019

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113
Article Google Scholar
Li F, Ooi B C, Özsu M T, Wu S. Distributed data management using mapreduce. ACM Computing Surveys (CSUR), 2014, 46(3): 31
Google Scholar
Hadoop A. Hadoop, 2009
Google Scholar
Lin J. The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. 2009, 57–62
Google Scholar
Ren K, Gibson G, Kwon Y C, Balazinska M, Howe B. Hadoop’s adolescence; a comparative workloads analysis from three research clusters. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 1452
Chapter Google Scholar
Racha S C. Load balancing map-reduce communications for efficient executions of applications in a cloud. Project Report, 2012
Google Scholar
Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with mapreduce. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 2397–2400
Google Scholar
Kolb L, Thor A, Rahm E. Load balancing for mapreduce-based entity resolution. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 618–629
Google Scholar
Gufler B, Augsten N, Reiser A, Kemper A. Handing data skew in mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science. 2011, 574–583
Google Scholar
Gufler B, Augsten N, Reiser A, Kemper A. Load balancing in mapreduce based on scalable cardinality estimates. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 522–533
Google Scholar
Chen Q, Yao J, Xiao Z. Libra: lightweight data skew mitigation in mapreduce. IEEE Transactions on Parallel and Distributed System, 2015, 26(9): 2520–2533
Article Google Scholar
DeWitt D, Stonebraker M. Mapreduce: a major step backwards. The Database Column, 2008, 1: 23
Google Scholar
Kwon Y C, Balazinska M, Howe B, Rolia J. A study of skew in mapreduce applications. Open Cirrus Summit, 2011, 11
Google Scholar
Rasmussen A, Conley M, Kapoor R, Lam U T, Porter G, Vahdat A. Themis: an I/O-efficient MapReduce. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012, 13
Google Scholar
Ren K, Kwon Y C, Balazinska M, Howe B. Hadoop’s adolescence: an analysis of hadoop usage in scientific workloads. Proceedings of the VLDB Endowment, 2013, 6(10): 853–864
Article Google Scholar
Shi J, Zou J, Lu J, Cao Z, Li S, Wang C. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment, 2014, 7(13): 1319–1330
Article Google Scholar
Shirazi B A, Kavi K M, Hurson A R. Scheduling and Load Balancing in Parallel and Distributed Systems. Los Alamitos: IEEE Computer Society Press, 1995
Google Scholar
Bharadwaj V, Ghose D, Mani V, Robertazzi T G. Scheduling Divisible Loads in Parallel and Distributed Systems. New York: John Wiley & Sons, 1996
Google Scholar
Ibrahim S, Jin H, Lu L,Wu S, He B. Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In: Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science. 2010, 17–24
Google Scholar
Ibrahim S, Jin H, Lu L, He B, Antoniu G. Handling partitioning skew in mapreduce using leen. Peer-to-Peer Networking and Applications, 2013, 6(4): 409–424
Article Google Scholar
Dhawalia P, Kailasam S, Janakiram D. Chisel: a resource savvy approach for handling skew in mapreduce applications. In: Proceedings of the 6th IEEE International Conference on Cloud Computing. 2013, 652–660
Google Scholar
Vernica R, Balmin A, Beyer K S, Ercegovac V. Adaptive mapreduce using situation-aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. 2012, 420–431
Google Scholar
Ramakrishnan S R, Swart G, Urmanov A. Balancing reducer skew in mapreduce workloads using progressive sampling. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012, 16
Google Scholar
Grover R, Carey M J. Extending map-reduce for efficient predicatebased sampling. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 486–497
Google Scholar
Kwon Y C, Balazinska M, Howe B, Rolia J. Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012, 25–36
Chapter Google Scholar
Dhawalia P, Kailasam S, Janakiram D. Chisel++: handling partitioning skew in mapreduce framework using efficient range partitioning technique. In: Proceedings of the 6th International Workshop on Data Intensive Distributed Computing. 2014, 21–28
Google Scholar
Metwally A, Faloutsos C. V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proceedings of the VLDB Endowment, 2012, 5(8):704–715
Article Google Scholar
Hassan M A H, Bamha M, Loulergue F. Handling data-skew effects in join operations using mapreduce. Procedia Computer Science, 2014, 29: 145–158
Article Google Scholar
Kwon Y C, Balazinska M, Howe B, Rolia J. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 75–86
Google Scholar
Cochran W G. Sampling Techniques. New York: John Wiley & Sons, 2007
MATH Google Scholar
Ullman J D. NP-complete scheduling problems. Journal of Computer and System Sciences, 1975, 10(3): 384–393
Article MathSciNet MATH Google Scholar
Graham R L. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 1969, 17(2): 416–429
Article MathSciNet MATH Google Scholar
Graham R L. Bounds on the performance of scheduling algorithms. Computer and Job Scheduling Theory, 1976, 165–227
Google Scholar

Download references

Acknowledgements

This work was supported by the Ministry of Science and Technology of China, National Key Research and Development Program (2016YFB1000703), the National Natural Science Foundation of China (Grant Nos. 61732014, 61332006, 61672432, 61472321, 61502390), the Natural Science Basic Research Plan in Shaanxi Province of China (2018JM6086) and the Fundamental Research Funds for the Central Universities (3102017jg02002).

Author information

Authors and Affiliations

School of Computer Science and Engineering, North Western Polytechnical University, Xi’an, 710072, China
Zhuo Wang, Qun Chen, Bo Suo, Wei Pan & Zhanhuai Li

Authors

Zhuo Wang
View author publications
Search author on:PubMed Google Scholar
Qun Chen
View author publications
Search author on:PubMed Google Scholar
Bo Suo
View author publications
Search author on:PubMed Google Scholar
Wei Pan
View author publications
Search author on:PubMed Google Scholar
Zhanhuai Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Qun Chen.

Additional information

Zhuo Wang is a PhD candidate at School of Computer Science and Technology, Northwestern Polytechnical University, China. He received his master degree from Tianjin University of Technology in 2011. His research interests include big data analysis, data management.

Qun Chen is a professor at School of Computer Science and Technology, Northwestern Polytechnical University, China. He is a member of China Computer Federation. He received his PhD degree from NUS. His research interests include data management and big data analysis.

Bo Suo is a PhD candidate at School of Computer Science and Technology, Northwestern Polytechnical University, China. He received his master degree from NWPU in 2010. He is a student member of China Computer Federation. His current research interests include big data analysis and big graph.

Wei Pan is an associate professor at School of Computer Science and Technology, Northwestern Polytechnical University, China. He is a member of China Computer Federation. His current research interests include big data analysis and in memory database.

Zhanhuai Li is a professor at School of Computer Science and Technology, Northwestern Polytechnical University, China. His research interests include data management and data mining.

Electronic supplementary material