Learning automata-based algorithms for MapReduce data skewness handling

Irandoost, Mohammad Amin; Rahmani, Amir Masoud; Setayeshi, Saeed

doi:10.1007/s11227-019-02855-0

Learning automata-based algorithms for MapReduce data skewness handling

Published: 22 April 2019

Volume 75, pages 6488–6516, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Mohammad Amin Irandoost¹,
Amir Masoud Rahmani¹ &
Saeed Setayeshi²

204 Accesses
5 Citations
Explore all metrics

Abstract

One of the most successful techniques for large-scale data processing is MapReduce. However, the performance of this technique is significantly reduced when there is skewness in data. The hash function is the default partitioner in Big Data frameworks such as Hadoop and Spark. Hash works perfectly when there is no data skewness, which is not the case in natural events. In this paper, we proposed two new algorithms, namely learning automata partitioner (LAP) and traffic cost-aware partitioner (TCAP) based on learning automata for handling reducer-side data skewness in MapReduce applications. LAP is based on clusters combination and performs well when data skewness degree is low. TCAP, on the other hand, has the advantage of considering network topology and balancing network traffic cost in the shuffling phase. TCAP supports cluster splitting and performs well in any data skewness degree. LAP and TCAP can also be used in heterogeneous environments. The performance of our algorithms was evaluated by several experiments and simulations by well-known benchmarks. The results confirmed that our algorithms performed better than other similar algorithms in most cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Edge computing: current trends, research challenges and future directions

Article 18 January 2021

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Dynamic resource allocation in cloud computing: analysis and taxonomies

Article 28 January 2022

References

Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Article Google Scholar
Zhang F, Malluhi QM, Elsayed T, Khan SU, Li K, Zomaya AY (2015) CloudFlow: a data-aware programming model for cloud workflow applications on modern HPC systems. Future Gener Comput Syst 51:98–110
Article Google Scholar
Hwang K, Xu Z (1998) Scalable parallel computing: technology, architecture, programming. McGraw-Hill Inc, New York
MATH Google Scholar
Jin H (2001) High performance mass storage and parallel I/O: technologies and applications. Wiley, London
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
White T (2015) Hadoop: the definitive guide. O’Reilly Media Inc, Sebastopol
Google Scholar
Ren K, Kwon Y, Balazinska M, Howe B (2013) Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc VLDB Endow 6(10):853–864
Article Google Scholar
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533
Article Google Scholar
Jiadong Y, Chen H, Fei H (2015) SASM: improving spark performance with adaptive skew mitigation. In: 2015 IEEE International Conference on Progress in Informatics and Computing (PIC), pp 102–107
Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y, Li H (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687
Article Google Scholar
Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, April 27 2014–May 2 2014, pp 2004–2012
Zhang X, Wu Y, Zhao C (2016) MrHeter: improving MapReduce performance in heterogeneous environments. Cluster Comput 19(4):1691–1701
Article Google Scholar
Dhawalia P, Kailasam S, Janakiram D (2014) Chisel++: handling partitioning skew in MapReduce framework using efficient range partitioning technique. In: Proceedings of the Sixth International Workshop on Data Intensive Distributed Computing, Vancouver, BC, Canada. ACM, pp 21–28, 2608021
Yan W, Xue Y, Malin B (2013) Scalable and robust key group size estimation for reducer load balancing in MapReduce. In: 2013 IEEE International Conference on Big Data, 6–9 Oct 2013, pp 156–162
Kolb L, Thor A, Rahm E (2012) Load balancing for MapReduce-based entity resolution. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. IEEE Computer Society, pp 618–629, 2310387
Gao Y, Zhang Y, Wang H, Li J, Gao H (2016) A distributed load balance algorithm of MapReduce for data quality detection. In: Gao H, Kim J, Sakurai Y (eds) Database systems for advanced applications: DASFAA 2016 international workshops: BDMS, BDQM, MoI, and SeCoP, Dallas, TX, USA, April 16–19, 2016, proceedings. Springer International Publishing, Cham, pp 294–306
Google Scholar
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Generat Comput Syst 78:287–301
Article Google Scholar
Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S (2013) Handling partitioning skew in MapReduce using LEEN. Peer-to-Peer Netw Appl 6(4):409–424
Article Google Scholar
Liu G, Zhu X, Wang J, Guo D, Bao W, Guo H (2017) SP-Partitioner: a novel partition method to handle intermediate data skew in spark streaming. Future Generat Comput Syst
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74(7):3415–3440. https://doi.org/10.1007/s11227-018-2391-9
Article Google Scholar
Lu W, Chen L, Wang L, Yuan H, Xing W, Yang Y (2018) NPIY: a novel partitioner for improving mapreduce performance. J Vis Lang Comput 46:1–11
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, USA 2012. ACM, pp 25–36, 2213840
Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: A resource savvy approach for handling skew in mapreduce applications. In: Cloud Computing (CLOUD), 2013 IEEE Sixth International Conference on 2013. IEEE, pp 652–660
Ke H, Li P, Guo S, Guo M (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828
Article Google Scholar
Beheshtifard Z, Meybodi MR (2016) Maximal throughput scheduling based on the physical interference model using learning automata. Ad Hoc Netw 45:65–79
Article Google Scholar
Akbari Torkestani J (2016) A learning approach to the bandwidth multicolouring problem. J Exp Theor Artif Intell 28(3):499–527. https://doi.org/10.1080/0952813X.2015.1015218
Article Google Scholar
Akbari Torkestani J (2012) A new approach to the job scheduling problem in computational grids. Cluster Comput 15(3):201–210. https://doi.org/10.1007/s10586-011-0192-5
Article Google Scholar
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) FP-Hadoop: efficient processing of skewed MapReduce jobs. Inf Syst 60:69–84
Article Google Scholar
Huang TC, Chu KC, Huang GH, Shen YC, Shieh CK (2017) Smart partitioning mechanism for dealing with intermediate data skew in reduce task on cloud computing. In: 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), 27–29 March 2017, pp 819–826
Irandoost MA, Rahmani AM, Setayeshi S (2018) A novel algorithm for handling reducer side data skew in MapReduce based on a learning automata game. Inf Sci. https://doi.org/10.1016/j.ins.2018.11.007
Article Google Scholar
Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, California 2012. ACM, pp 1–14, 2391245
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh ChinaGrid Annual Conference, 20–23 Sept. 2012, pp 1–8
Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) LEEN: locality/fairness-aware key partitioning for MapReduce in the cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), Nov 30 2010–Dec 3 2010, pp 17-24
Kumaresan V, Baskaran R (2016) AEGEUS: an online partition skew mitigation algorithm for MapReduce. In: Proceedings of the International Conference on Informatics and Analytics, Pondicherry, India. ACM, pp 1–8, 2980461
Irandoost MA, Rahmani AM, Setayeshi S (2019) MapReduce data skewness handling: a systematic literature review. https://doi.org/10.1007/s10766-019-00627-0
Liu Z, Zhang Q, Zhani MF, Boutaba R, Liu Y, Gong Z (2015) DREAMS: dynamic resource allocation for MapReduce with data skew. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 11–15 May 2015, pp 18–26
Afrati FN, Stasinopoulos N, Ullman JD, Vassilakopoulos A (2018) SharesSkew: an algorithm to handle skew for joins in MapReduce. Inf Syst 77:129–150
Article Google Scholar
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput. https://doi.org/10.1007/s11227-018-2578-0
Article Google Scholar
Zhao X, Zhang J, Qin X (2018) kNN-DP: handling data skewness in kNN joins using MapReduce. IEEE Trans Parallel Distrib Syst 29(3):600–613. https://doi.org/10.1109/TPDS.2017.2767596
Article Google Scholar
Narendra KS, Thathachar MA (2012) Learning automata: an introduction. Courier Corporation, North Chelmsford
Google Scholar
Thathachar MAL, Harita BR (1987) Learning automata with changing number of actions. IEEE Trans Syst Man Cybern 17(6):1095–1100. https://doi.org/10.1109/TSMC.1987.6499323
Article Google Scholar
Kivity A, Kamay Y, Laor D, Lublin U, Liguori A (2007) kvm: the Linux virtual machine monitor. In: Proceedings of the Linux Symposium 2007, pp 225–230
Hash Partitioner (2018) https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html. Accessed June 2018
TotalOrderPartitioner (2018) https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/mapreduce/lib/partition/TotalOrderPartitioner.html. Accessed June 2018
Hammoud S, Maozhen L, Yang L, Nasullah Khalid A, Zelong L (2010) MRSim: a discrete event based MapReduce simulator. In: 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, 10–12 Aug 2010, pp 2993–2997
Kolberg W, Marcos PDB, Anjos JCS, Miyazaki AKS, Geyer CR, Arantes LB (2013) MRSG—a MapReduce simulator over SimGrid. Parallel Comput 39(4):233–244
Article Google Scholar
Murthy AC (2009) Mumak: Map-reduce simulator. MAPREDUCE-728, Apache JIRA

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Mohammad Amin Irandoost & Amir Masoud Rahmani
Department of Medical Radiation Engineering, Amirkabir University of Technology, Tehran, Iran
Saeed Setayeshi

Authors

Mohammad Amin Irandoost
View author publications
You can also search for this author in PubMed Google Scholar
Amir Masoud Rahmani
View author publications
You can also search for this author in PubMed Google Scholar
Saeed Setayeshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amir Masoud Rahmani.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Irandoost, M.A., Rahmani, A.M. & Setayeshi, S. Learning automata-based algorithms for MapReduce data skewness handling. J Supercomput 75, 6488–6516 (2019). https://doi.org/10.1007/s11227-019-02855-0

Download citation

Published: 22 April 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11227-019-02855-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning automata-based algorithms for MapReduce data skewness handling

Abstract

Access this article

Similar content being viewed by others

Edge computing: current trends, research challenges and future directions

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning automata-based algorithms for MapReduce data skewness handling

Abstract

Access this article

Similar content being viewed by others

Edge computing: current trends, research challenges and future directions

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation