An Adaptive Skew Insensitive Join Algorithm for Large Scale Data Analytics

Liao, Wenjing; Wang, Tengjiao; Li, Hongyan; Yang, Dongqing; Qiu, Zhen; Lei, Kai

doi:10.1007/978-3-319-11116-2_44

Wenjing Liao^19,22,
Tengjiao Wang^19,20,22,
Hongyan Li^20,21,
Dongqing Yang²⁰,
Zhen Qiu^20,21 &
…
Kai Lei¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Included in the following conference series:

Asia-Pacific Web Conference

3359 Accesses
4 Citations

Abstract

With data explosion in recent years, timely and cost-effective analytics over large scale data has been a hotspot of data management research. Join is an important operation in database query. However, data skew happens naturally in many applications, which will severely degrade the performance of most join algorithms. To address this problem, this paper introduces an Adaptive Skew Insensitive(ASI) join algorithm to handle with serious data skew. Based on our cost analysis, ASI join algorithm can adaptively choose the best join algorithm for different inputs. Compared with several state-of-the-art join methods through adequate experiments, our method achieves significant improvement of join efficiency dealing with data skew.

This work was supported by Natural Science Foundation of China (No.60973002 and No.61170003), the National High Technology Research and Development Program of China (Grant No. 2012AA011002), and MOE-CMCC Research Fund.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improved join operations using ORC in HIVE

Article 08 December 2016

A Comparative Study of Join Algorithms in Spark

JCC-H: Adding Join Crossing Correlations with Skew to TPC-H

References

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: The next frontier for innovation, competition, and productivity (2011)
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Google Scholar
Wilson, R.: Social choice theory without the pareto principle. Journal of Economic Theory 5(3), 478–486 (1972)
Article MathSciNet Google Scholar
Walton, C.B., Dale, A.G., Jenevein, R.M.: A taxonomy and performance model of data skew effects in parallel joins. VLDB 91, 537–548 (1991)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165–178. ACM (2009)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. Proceedings of the VLDB Endowment 2(2), 1481–1492 (2009)
Article Google Scholar
Lam, C.: Hadoop in action. Manning Publications Co. (2010)
Google Scholar
DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical skew handling in parallel joins. VLDB 92, 27–40 (1992)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
Atta, F., Viglas, S.D., Niazi, S.: Sand join–A skew handling join algorithm for google’s mapreduce framework. In: 2011 IEEE 14th International Multitopic Conference (INMIC), pp. 170–175. IEEE (2011)
Google Scholar
Gates, A.: Programming Pig. O’Reilly (2011)
Google Scholar
Council, T.P.P.: Tpc-h benchmark specification (2008), http://www.tcp.org/hspec.html

Download references

Author information

Authors and Affiliations

School of Electronics and Computer Engineering(ECE), Peking University, Shenzhen, 518055, China
Wenjing Liao, Tengjiao Wang & Kai Lei
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Tengjiao Wang, Hongyan Li, Dongqing Yang & Zhen Qiu
Key Laboratory of Machine Perception, Peking University, Ministry of Education, Beijing, 100871, China
Hongyan Li & Zhen Qiu
Key Laboratory of High Confidence Software Technologies, Peking University, Ministry of Education, Beijing, 100871, China
Wenjing Liao & Tengjiao Wang

Authors

Wenjing Liao
View author publications
You can also search for this author in PubMed Google Scholar
Tengjiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Kai Lei
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Beijing Institute of Spacecraft System Engineering, Beijing, China
Lei Chen
School of Computer Science, National University of Defense Technology, 410073, Changsha, Hunan, China
Yan Jia
RMIT University, Melbourne, Australia
Timos Sellis
School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liao, W., Wang, T., Li, H., Yang, D., Qiu, Z., Lei, K. (2014). An Adaptive Skew Insensitive Join Algorithm for Large Scale Data Analytics. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-11116-2_44
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics