Abstract
With data explosion in recent years, timely and cost-effective analytics over large scale data has been a hotspot of data management research. Join is an important operation in database query. However, data skew happens naturally in many applications, which will severely degrade the performance of most join algorithms. To address this problem, this paper introduces an Adaptive Skew Insensitive(ASI) join algorithm to handle with serious data skew. Based on our cost analysis, ASI join algorithm can adaptively choose the best join algorithm for different inputs. Compared with several state-of-the-art join methods through adequate experiments, our method achieves significant improvement of join efficiency dealing with data skew.
This work was supported by Natural Science Foundation of China (No.60973002 and No.61170003), the National High Technology Research and Development Program of China (Grant No. 2012AA011002), and MOE-CMCC Research Fund.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: The next frontier for innovation, competition, and productivity (2011)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Wilson, R.: Social choice theory without the pareto principle. Journal of Economic Theory 5(3), 478–486 (1972)
Walton, C.B., Dale, A.G., Jenevein, R.M.: A taxonomy and performance model of data skew effects in parallel joins. VLDB 91, 537–548 (1991)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165–178. ACM (2009)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proceedings of the VLDB Endowment 2(2), 1481–1492 (2009)
Lam, C.: Hadoop in action. Manning Publications Co. (2010)
DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical skew handling in parallel joins. VLDB 92, 27–40 (1992)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Atta, F., Viglas, S.D., Niazi, S.: Sand join–A skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International Multitopic Conference (INMIC), pp. 170–175. IEEE (2011)
Gates, A.: Programming Pig. O’Reilly (2011)
Council, T.P.P.: Tpc-h benchmark specification (2008), http://www.tcp.org/hspec.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Liao, W., Wang, T., Li, H., Yang, D., Qiu, Z., Lei, K. (2014). An Adaptive Skew Insensitive Join Algorithm for Large Scale Data Analytics. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-11116-2_44
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)