Abstract
One of the most successful techniques for large-scale data processing is MapReduce. However, the performance of this technique is significantly reduced when there is skewness in data. The hash function is the default partitioner in Big Data frameworks such as Hadoop and Spark. Hash works perfectly when there is no data skewness, which is not the case in natural events. In this paper, we proposed two new algorithms, namely learning automata partitioner (LAP) and traffic cost-aware partitioner (TCAP) based on learning automata for handling reducer-side data skewness in MapReduce applications. LAP is based on clusters combination and performs well when data skewness degree is low. TCAP, on the other hand, has the advantage of considering network topology and balancing network traffic cost in the shuffling phase. TCAP supports cluster splitting and performs well in any data skewness degree. LAP and TCAP can also be used in heterogeneous environments. The performance of our algorithms was evaluated by several experiments and simulations by well-known benchmarks. The results confirmed that our algorithms performed better than other similar algorithms in most cases.
Similar content being viewed by others
References
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Zhang F, Malluhi QM, Elsayed T, Khan SU, Li K, Zomaya AY (2015) CloudFlow: a data-aware programming model for cloud workflow applications on modern HPC systems. Future Gener Comput Syst 51:98–110
Hwang K, Xu Z (1998) Scalable parallel computing: technology, architecture, programming. McGraw-Hill Inc, New York
Jin H (2001) High performance mass storage and parallel I/O: technologies and applications. Wiley, London
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
White T (2015) Hadoop: the definitive guide. O’Reilly Media Inc, Sebastopol
Ren K, Kwon Y, Balazinska M, Howe B (2013) Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc VLDB Endow 6(10):853–864
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533
Jiadong Y, Chen H, Fei H (2015) SASM: improving spark performance with adaptive skew mitigation. In: 2015 IEEE International Conference on Progress in Informatics and Computing (PIC), pp 102–107
Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y, Li H (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687
Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, April 27 2014–May 2 2014, pp 2004–2012
Zhang X, Wu Y, Zhao C (2016) MrHeter: improving MapReduce performance in heterogeneous environments. Cluster Comput 19(4):1691–1701
Dhawalia P, Kailasam S, Janakiram D (2014) Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique. In: Proceedings of the Sixth International Workshop on Data Intensive Distributed Computing, Vancouver, BC, Canada. ACM, pp 21–28, 2608021
Yan W, Xue Y, Malin B (2013) Scalable and robust key group size estimation for reducer load balancing in MapReduce. In: 2013 IEEE International Conference on Big Data, 6–9 Oct 2013, pp 156–162
Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. IEEE Computer Society, pp 618–629, 2310387
Gao Y, Zhang Y, Wang H, Li J, Gao H (2016) A distributed load balance algorithm of MapReduce for data quality detection. In: Gao H, Kim J, Sakurai Y (eds) Database systems for advanced applications: DASFAA 2016 international workshops: BDMS, BDQM, MoI, and SeCoP, Dallas, TX, USA, April 16–19, 2016, proceedings. Springer International Publishing, Cham, pp 294–306
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Generat Comput Syst 78:287–301
Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S (2013) Handling partitioning skew in MapReduce using LEEN. Peer-to-Peer Netw Appl 6(4):409–424
Liu G, Zhu X, Wang J, Guo D, Bao W, Guo H (2017) SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Generat Comput Syst
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74(7):3415–3440. https://doi.org/10.1007/s11227-018-2391-9
Lu W, Chen L, Wang L, Yuan H, Xing W, Yang Y (2018) NPIY: a novel partitioner for improving mapreduce performance. J Vis Lang Comput 46:1–11
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, USA 2012. ACM, pp 25–36, 2213840
Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: A resource savvy approach for handling skew in mapreduce applications. In: Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on 2013. IEEE, pp 652–660
Ke H, Li P, Guo S, Guo M (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828
Beheshtifard Z, Meybodi MR (2016) Maximal throughput scheduling based on the physical interference model using learning automata. Ad Hoc Netw 45:65–79
Akbari Torkestani J (2016) A learning approach to the bandwidth multicolouring problem. J Exp Theor Artif Intell 28(3):499–527. https://doi.org/10.1080/0952813X.2015.1015218
Akbari Torkestani J (2012) A new approach to the job scheduling problem in computational grids. Cluster Comput 15(3):201–210. https://doi.org/10.1007/s10586-011-0192-5
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84
Huang TC, Chu KC, Huang GH, Shen YC, Shieh CK (2017) Smart partitioning mechanism for dealing with intermediate data skew in reduce task on cloud computing. In: 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), 27–29 March 2017, pp 819–826
Irandoost MA, Rahmani AM, Setayeshi S (2018) A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf Sci. https://doi.org/10.1016/j.ins.2018.11.007
Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, California 2012. ACM, pp 1–14, 2391245
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh ChinaGrid Annual Conference, 20–23 Sept. 2012, pp 1–8
Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) LEEN: locality/fairness-aware key partitioning for MapReduce in the cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), Nov 30 2010–Dec 3 2010, pp 17-24
Kumaresan V, Baskaran R (2016) AEGEUS: an online partition skew mitigation algorithm for MapReduce. In: Proceedings of the International Conference on Informatics and Analytics, Pondicherry, India. ACM, pp 1–8, 2980461
Irandoost MA, Rahmani AM, Setayeshi S (2019) MapReduce data skewness handling: a systematic literature review. https://doi.org/10.1007/s10766-019-00627-0
Liu Z, Zhang Q, Zhani MF, Boutaba R, Liu Y, Gong Z (2015) DREAMS: dynamic resource allocation for MapReduce with data skew. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 11–15 May 2015, pp 18–26
Afrati FN, Stasinopoulos N, Ullman JD, Vassilakopoulos A (2018) SharesSkew: an algorithm to handle skew for joins in MapReduce. Inf Syst 77:129–150
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput. https://doi.org/10.1007/s11227-018-2578-0
Zhao X, Zhang J, Qin X (2018) kNN-DP: handling data skewness in kNN joins using MapReduce. IEEE Trans Parallel Distrib Syst 29(3):600–613. https://doi.org/10.1109/TPDS.2017.2767596
Narendra KS, Thathachar MA (2012) Learning automata: an introduction. Courier Corporation, North Chelmsford
Thathachar MAL, Harita BR (1987) Learning automata with changing number of actions. IEEE Trans Syst Man Cybern 17(6):1095–1100. https://doi.org/10.1109/TSMC.1987.6499323
Kivity A, Kamay Y, Laor D, Lublin U, Liguori A (2007) kvm: the Linux virtual machine monitor. In: Proceedings of the Linux Symposium 2007, pp 225–230
Hash Partitioner (2018) https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html. Accessed June 2018
TotalOrderPartitioner (2018) https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapreduce/lib/partition/TotalOrderPartitioner.html. Accessed June 2018
Hammoud S, Maozhen L, Yang L, Nasullah Khalid A, Zelong L (2010) MRSim: a discrete event based MapReduce simulator. In: 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, 10–12 Aug 2010, pp 2993–2997
Kolberg W, Marcos PDB, Anjos JCS, Miyazaki AKS, Geyer CR, Arantes LB (2013) MRSG—a MapReduce simulator over SimGrid. Parallel Comput 39(4):233–244
Murthy AC (2009) Mumak: Map-reduce simulator. MAPREDUCE-728, Apache JIRA
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Irandoost, M.A., Rahmani, A.M. & Setayeshi, S. Learning automata-based algorithms for MapReduce data skewness handling. J Supercomput 75, 6488–6516 (2019). https://doi.org/10.1007/s11227-019-02855-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02855-0