Abstract
Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient when handling skewed data. The presence of data skew in input data leads to considerable load imbalance and performance degradation. This paper proposes a new skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which can improve the load balancing for reduce tasks. The proposed method considers the properties of both input and output data through a proposed stream sampling algorithm. FGSD introduces a new approach for distribution of input data which leads to efficiently handling redistribution and join product skew. The experimental results confirm that our solution can not only achieve higher balancing performance, but also reduce the execution time of a job with varying degrees of the data skew. Furthermore, FGSD does not require any modification to the MapReduce environment and is applicable to complex join.
Similar content being viewed by others
References
Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004
Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horiz 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002
Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. https://doi.org/10.1007/s11227-016-1677-z
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Rodríguez-Mazahua L, Rodríguez-Enríquez C-A, Sánchez-Cervantes JL, Cervantes J, García-Alcaraz JL, Alor-Hernández G (2016) A general perspective of big data: applications, tools, challenges and trends. J Supercomput 72(8):3073–3113. https://doi.org/10.1007/s11227-015-1501-1
Arabnia HR (1996) Distributed stereo-correlation algorithm. Comput Commun 19(8):707–711. https://doi.org/10.1016/S0140-3664(96)01104-8
Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1):185–202. https://doi.org/10.1023/a:1019119117297
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188. https://doi.org/10.1111/j.1467-8659.1986.tb00296.x
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp 99–110. https://doi.org/10.1145/1739041.1739056
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 975–986. https://doi.org/10.1145/1807167.1807273
Jiang D, Tung AKH, Chen G (2011) MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311. https://doi.org/10.1109/TKDE.2010.248
Okcan A, Riedewald M (2011) Processing theta-joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 949–960. https://doi.org/10.1145/1989323.1989423
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 495–506. https://doi.org/10.1145/1807167.1807222
Yang H-c, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 1029–1040. https://doi.org/10.1145/1247480.1247602
Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380. https://doi.org/10.1007/s00778-013-0319-9
Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with MapReduce: a survey. SIGMOD Rec 40(4):11–20. https://doi.org/10.1145/2094114.2094118
Atta F, Viglas SD, Niazi S (2011) SAND Join: A skew handling join algorithm for Google’s MapReduce framework. In: 2011 IEEE 14th International Multitopic Conference, pp 170–175. https://doi.org/10.1109/inmic.2011.6151466
DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. https://doi.org/10.1109/TPDS.2014.2350972
Gufler B, Augsten N, Reiser A, Kemper A (2012) Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In: 2012 IEEE 28th International Conference on Data Engineering, pp 522–533. https://doi.org/10.1109/icde.2012.58
Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36. https://doi.org/10.1145/2213836.2213840
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-Based Partitioning in MapReduce for Skewed Data. In: 2012 Seventh ChinaGrid Annual Conference, pp 1–8. https://doi.org/10.1109/chinagrid.2012.18
Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput. https://doi.org/10.1007/s11227-018-2391-9
Vitorovic A, Elseidy M, Koch C (2016) Load balancing and skew resilience for parallel joins. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp 313–324. https://doi.org/10.1109/icde.2016.7498250
Myung J, Shim J, Yeon J, S-g Lee (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299. https://doi.org/10.1016/j.eswa.2015.12.024
Beame P, Koutris P, Suciu D (2014) Skew in parallel query processing. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 212–223. https://doi.org/10.1145/2594538.2594558
Epstein R, Stonebraker M, Wong E (1978) Distributed query processing in a relational data base system. Proceedings of the 1978 ACM SIGMOD International Conference on Management of Data, pp 169–180. https://doi.org/10.1145/509252.509292
Elseidy M, Elguindy A, Vitorovic A, Koch C (2014) Scalable and adaptive online joins. Proc VLDB Endow 7(6):441–452. https://doi.org/10.14778/2732279.2732281
Cochran WG (1977) Sampling techniques. Wiley, New York
Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, pp 2004–2012. https://doi.org/10.1109/infocom.2014.6848141
Tillé Y (2006) Sampling algorithms. Springer, New York. https://doi.org/10.1007/0-387-34240-0
Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28, pp III-531–III-539
Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. SIGMOD Rec 28(2):263–274. https://doi.org/10.1145/304181.304206
Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. https://doi.org/10.1137/0117039
Mishra P, Eich MH (1992) Join processing in relational databases. ACM Comput Surv 24(1):63–113. https://doi.org/10.1145/128762.128764
Walton CB, Dale AG, Jenevein RM (1991) A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp 537–548
Harada L, Kitsuregawa M (1995) Dynamic join product skew handling for hash-joins in shared-nothing database systems. In: DASFAA
Jimmy L (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop
Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston
Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14. https://doi.org/10.1145/2391229.2391245
Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gavagsaz, E., Rezaee, A. & Haj Seyyed Javadi, H. Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput 75, 228–254 (2019). https://doi.org/10.1007/s11227-018-2578-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2578-0