Load balancing in join algorithms for skewed data in MapReduce systems

Gavagsaz, Elaheh; Rezaee, Ali; Haj Seyyed Javadi, Hamid

doi:10.1007/s11227-018-2578-0

Load balancing in join algorithms for skewed data in MapReduce systems

Published: 01 September 2018

Volume 75, pages 228–254, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

521 Accesses
18 Citations
Explore all metrics

Abstract

Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient when handling skewed data. The presence of data skew in input data leads to considerable load imbalance and performance degradation. This paper proposes a new skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which can improve the load balancing for reduce tasks. The proposed method considers the properties of both input and output data through a proposed stream sampling algorithm. FGSD introduces a new approach for distribution of input data which leads to efficiently handling redistribution and join product skew. The experimental results confirm that our solution can not only achieve higher balancing performance, but also reduce the execution time of a job with varying degrees of the data skew. Furthermore, FGSD does not require any modification to the MapReduce environment and is applicable to complex join.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

References

Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004
Article Google Scholar
Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horiz 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002
Article Google Scholar
Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. https://doi.org/10.1007/s11227-016-1677-z
Article Google Scholar
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Article Google Scholar
Rodríguez-Mazahua L, Rodríguez-Enríquez C-A, Sánchez-Cervantes JL, Cervantes J, García-Alcaraz JL, Alor-Hernández G (2016) A general perspective of big data: applications, tools, challenges and trends. J Supercomput 72(8):3073–3113. https://doi.org/10.1007/s11227-015-1501-1
Article Google Scholar
Arabnia HR (1996) Distributed stereo-correlation algorithm. Comput Commun 19(8):707–711. https://doi.org/10.1016/S0140-3664(96)01104-8
Article Google Scholar
Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1):185–202. https://doi.org/10.1023/a:1019119117297
Article Google Scholar
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188. https://doi.org/10.1111/j.1467-8659.1986.tb00296.x
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Google Scholar
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp 99–110. https://doi.org/10.1145/1739041.1739056
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 975–986. https://doi.org/10.1145/1807167.1807273
Jiang D, Tung AKH, Chen G (2011) MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311. https://doi.org/10.1109/TKDE.2010.248
Article Google Scholar
Okcan A, Riedewald M (2011) Processing theta-joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 949–960. https://doi.org/10.1145/1989323.1989423
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 495–506. https://doi.org/10.1145/1807167.1807222
Yang H-c, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 1029–1040. https://doi.org/10.1145/1247480.1247602
Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380. https://doi.org/10.1007/s00778-013-0319-9
Article Google Scholar
Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with MapReduce: a survey. SIGMOD Rec 40(4):11–20. https://doi.org/10.1145/2094114.2094118
Article Google Scholar
Atta F, Viglas SD, Niazi S (2011) SAND Join: A skew handling join algorithm for Google’s MapReduce framework. In: 2011 IEEE 14th International Multitopic Conference, pp 170–175. https://doi.org/10.1109/inmic.2011.6151466
DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. https://doi.org/10.1109/TPDS.2014.2350972
Article Google Scholar
Gufler B, Augsten N, Reiser A, Kemper A (2012) Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In: 2012 IEEE 28th International Conference on Data Engineering, pp 522–533. https://doi.org/10.1109/icde.2012.58
Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36. https://doi.org/10.1145/2213836.2213840
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027
Article Google Scholar
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-Based Partitioning in MapReduce for Skewed Data. In: 2012 Seventh ChinaGrid Annual Conference, pp 1–8. https://doi.org/10.1109/chinagrid.2012.18
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput. https://doi.org/10.1007/s11227-018-2391-9
Google Scholar
Vitorovic A, Elseidy M, Koch C (2016) Load balancing and skew resilience for parallel joins. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp 313–324. https://doi.org/10.1109/icde.2016.7498250
Myung J, Shim J, Yeon J, S-g Lee (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299. https://doi.org/10.1016/j.eswa.2015.12.024
Article Google Scholar
Beame P, Koutris P, Suciu D (2014) Skew in parallel query processing. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 212–223. https://doi.org/10.1145/2594538.2594558
Epstein R, Stonebraker M, Wong E (1978) Distributed query processing in a relational data base system. Proceedings of the 1978 ACM SIGMOD International Conference on Management of Data, pp 169–180. https://doi.org/10.1145/509252.509292
Elseidy M, Elguindy A, Vitorovic A, Koch C (2014) Scalable and adaptive online joins. Proc VLDB Endow 7(6):441–452. https://doi.org/10.14778/2732279.2732281
Article Google Scholar
Cochran WG (1977) Sampling techniques. Wiley, New York
MATH Google Scholar
Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, pp 2004–2012. https://doi.org/10.1109/infocom.2014.6848141
Tillé Y (2006) Sampling algorithms. Springer, New York. https://doi.org/10.1007/0-387-34240-0
MATH Google Scholar
Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28, pp III-531–III-539
Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. SIGMOD Rec 28(2):263–274. https://doi.org/10.1145/304181.304206
Article Google Scholar
Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. https://doi.org/10.1137/0117039
Article MathSciNet MATH Google Scholar
Mishra P, Eich MH (1992) Join processing in relational databases. ACM Comput Surv 24(1):63–113. https://doi.org/10.1145/128762.128764
Article Google Scholar
Walton CB, Dale AG, Jenevein RM (1991) A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp 537–548
Harada L, Kitsuregawa M (1995) Dynamic join product skew handling for hash-joins in shared-nothing database systems. In: DASFAA
Jimmy L (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop
Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston
Google Scholar
Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14. https://doi.org/10.1145/2391229.2391245
Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Elaheh Gavagsaz & Ali Rezaee
Department of Applied Mathematics, Faculty of Mathematics and Computer Science, Shahed University, Tehran, Iran
Hamid Haj Seyyed Javadi

Authors

Elaheh Gavagsaz
View author publications
You can also search for this author in PubMed Google Scholar
Ali Rezaee
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Haj Seyyed Javadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Rezaee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gavagsaz, E., Rezaee, A. & Haj Seyyed Javadi, H. Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput 75, 228–254 (2019). https://doi.org/10.1007/s11227-018-2578-0

Download citation

Published: 01 September 2018
Issue Date: 09 January 2019
DOI: https://doi.org/10.1007/s11227-018-2578-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Load balancing in join algorithms for skewed data in MapReduce systems

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey on the evolution of stream processing systems

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Load balancing in join algorithms for skewed data in MapReduce systems

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey on the evolution of stream processing systems

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation