Skip to main content
Log in

Learning automata-based algorithms for MapReduce data skewness handling

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

One of the most successful techniques for large-scale data processing is MapReduce. However, the performance of this technique is significantly reduced when there is skewness in data. The hash function is the default partitioner in Big Data frameworks such as Hadoop and Spark. Hash works perfectly when there is no data skewness, which is not the case in natural events. In this paper, we proposed two new algorithms, namely learning automata partitioner (LAP) and traffic cost-aware partitioner (TCAP) based on learning automata for handling reducer-side data skewness in MapReduce applications. LAP is based on clusters combination and performs well when data skewness degree is low. TCAP, on the other hand, has the advantage of considering network topology and balancing network traffic cost in the shuffling phase. TCAP supports cluster splitting and performs well in any data skewness degree. LAP and TCAP can also be used in heterogeneous environments. The performance of our algorithms was evaluated by several experiments and simulations by well-known benchmarks. The results confirmed that our algorithms performed better than other similar algorithms in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013

    Article  Google Scholar 

  2. Zhang F, Malluhi QM, Elsayed T, Khan SU, Li K, Zomaya AY (2015) CloudFlow: a data-aware programming model for cloud workflow applications on modern HPC systems. Future Gener Comput Syst 51:98–110

    Article  Google Scholar 

  3. Hwang K, Xu Z (1998) Scalable parallel computing: technology, architecture, programming. McGraw-Hill Inc, New York

    MATH  Google Scholar 

  4. Jin H (2001) High performance mass storage and parallel I/O: technologies and applications. Wiley, London

    Google Scholar 

  5. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  6. White T (2015) Hadoop: the definitive guide. O’Reilly Media Inc, Sebastopol

    Google Scholar 

  7. Ren K, Kwon Y, Balazinska M, Howe B (2013) Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc VLDB Endow 6(10):853–864

    Article  Google Scholar 

  8. Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533

    Article  Google Scholar 

  9. Jiadong Y, Chen H, Fei H (2015) SASM: improving spark performance with adaptive skew mitigation. In: 2015 IEEE International Conference on Progress in Informatics and Computing (PIC), pp 102–107

  10. Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y, Li H (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687

    Article  Google Scholar 

  11. Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, April 27 2014–May 2 2014, pp 2004–2012

  12. Zhang X, Wu Y, Zhao C (2016) MrHeter: improving MapReduce performance in heterogeneous environments. Cluster Comput 19(4):1691–1701

    Article  Google Scholar 

  13. Dhawalia P, Kailasam S, Janakiram D (2014) Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique. In: Proceedings of the Sixth International Workshop on Data Intensive Distributed Computing, Vancouver, BC, Canada. ACM, pp 21–28, 2608021

  14. Yan W, Xue Y, Malin B (2013) Scalable and robust key group size estimation for reducer load balancing in MapReduce. In: 2013 IEEE International Conference on Big Data, 6–9 Oct 2013, pp 156–162

  15. Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. IEEE Computer Society, pp 618–629, 2310387

  16. Gao Y, Zhang Y, Wang H, Li J, Gao H (2016) A distributed load balance algorithm of MapReduce for data quality detection. In: Gao H, Kim J, Sakurai Y (eds) Database systems for advanced applications: DASFAA 2016 international workshops: BDMS, BDQM, MoI, and SeCoP, Dallas, TX, USA, April 16–19, 2016, proceedings. Springer International Publishing, Cham, pp 294–306

    Google Scholar 

  17. Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Generat Comput Syst 78:287–301

    Article  Google Scholar 

  18. Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S (2013) Handling partitioning skew in MapReduce using LEEN. Peer-to-Peer Netw Appl 6(4):409–424

    Article  Google Scholar 

  19. Liu G, Zhu X, Wang J, Guo D, Bao W, Guo H (2017) SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Generat Comput Syst

  20. Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74(7):3415–3440. https://doi.org/10.1007/s11227-018-2391-9

    Article  Google Scholar 

  21. Lu W, Chen L, Wang L, Yuan H, Xing W, Yang Y (2018) NPIY: a novel partitioner for improving mapreduce performance. J Vis Lang Comput 46:1–11

    Article  Google Scholar 

  22. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, USA 2012. ACM, pp 25–36, 2213840

  23. Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: A resource savvy approach for handling skew in mapreduce applications. In: Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on 2013. IEEE, pp 652–660

  24. Ke H, Li P, Guo S, Guo M (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828

    Article  Google Scholar 

  25. Beheshtifard Z, Meybodi MR (2016) Maximal throughput scheduling based on the physical interference model using learning automata. Ad Hoc Netw 45:65–79

    Article  Google Scholar 

  26. Akbari Torkestani J (2016) A learning approach to the bandwidth multicolouring problem. J Exp Theor Artif Intell 28(3):499–527. https://doi.org/10.1080/0952813X.2015.1015218

    Article  Google Scholar 

  27. Akbari Torkestani J (2012) A new approach to the job scheduling problem in computational grids. Cluster Comput 15(3):201–210. https://doi.org/10.1007/s10586-011-0192-5

    Article  Google Scholar 

  28. Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84

    Article  Google Scholar 

  29. Huang TC, Chu KC, Huang GH, Shen YC, Shieh CK (2017) Smart partitioning mechanism for dealing with intermediate data skew in reduce task on cloud computing. In: 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), 27–29 March 2017, pp 819–826

  30. Irandoost MA, Rahmani AM, Setayeshi S (2018) A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf Sci. https://doi.org/10.1016/j.ins.2018.11.007

    Article  Google Scholar 

  31. Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, California 2012. ACM, pp 1–14, 2391245

  32. Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh ChinaGrid Annual Conference, 20–23 Sept. 2012, pp 1–8

  33. Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) LEEN: locality/fairness-aware key partitioning for MapReduce in the cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), Nov 30 2010–Dec 3 2010, pp 17-24

  34. Kumaresan V, Baskaran R (2016) AEGEUS: an online partition skew mitigation algorithm for MapReduce. In: Proceedings of the International Conference on Informatics and Analytics, Pondicherry, India. ACM, pp 1–8, 2980461

  35. Irandoost MA, Rahmani AM, Setayeshi S (2019) MapReduce data skewness handling: a systematic literature review. https://doi.org/10.1007/s10766-019-00627-0

  36. Liu Z, Zhang Q, Zhani MF, Boutaba R, Liu Y, Gong Z (2015) DREAMS: dynamic resource allocation for MapReduce with data skew. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 11–15 May 2015, pp 18–26

  37. Afrati FN, Stasinopoulos N, Ullman JD, Vassilakopoulos A (2018) SharesSkew: an algorithm to handle skew for joins in MapReduce. Inf Syst 77:129–150

    Article  Google Scholar 

  38. Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput. https://doi.org/10.1007/s11227-018-2578-0

    Article  Google Scholar 

  39. Zhao X, Zhang J, Qin X (2018) kNN-DP: handling data skewness in kNN joins using MapReduce. IEEE Trans Parallel Distrib Syst 29(3):600–613. https://doi.org/10.1109/TPDS.2017.2767596

    Article  Google Scholar 

  40. Narendra KS, Thathachar MA (2012) Learning automata: an introduction. Courier Corporation, North Chelmsford

    Google Scholar 

  41. Thathachar MAL, Harita BR (1987) Learning automata with changing number of actions. IEEE Trans Syst Man Cybern 17(6):1095–1100. https://doi.org/10.1109/TSMC.1987.6499323

    Article  Google Scholar 

  42. Kivity A, Kamay Y, Laor D, Lublin U, Liguori A (2007) kvm: the Linux virtual machine monitor. In: Proceedings of the Linux Symposium 2007, pp 225–230

  43. Hash Partitioner (2018) https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html. Accessed June 2018

  44. TotalOrderPartitioner (2018) https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapreduce/lib/partition/TotalOrderPartitioner.html. Accessed June 2018

  45. Hammoud S, Maozhen L, Yang L, Nasullah Khalid A, Zelong L (2010) MRSim: a discrete event based MapReduce simulator. In: 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, 10–12 Aug 2010, pp 2993–2997

  46. Kolberg W, Marcos PDB, Anjos JCS, Miyazaki AKS, Geyer CR, Arantes LB (2013) MRSG—a MapReduce simulator over SimGrid. Parallel Comput 39(4):233–244

    Article  Google Scholar 

  47. Murthy AC (2009) Mumak: Map-reduce simulator. MAPREDUCE-728, Apache JIRA

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amir Masoud Rahmani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Irandoost, M.A., Rahmani, A.M. & Setayeshi, S. Learning automata-based algorithms for MapReduce data skewness handling. J Supercomput 75, 6488–6516 (2019). https://doi.org/10.1007/s11227-019-02855-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02855-0

Keywords

Navigation