Efficient Query Algorithm of Coallocation -Parallel-Hash-Join in the Cloud Data Center

Shen, Yao; Lu, Ping; Qin, Xiaolin; Qian, Yuming; Wang, Sheng

doi:10.1007/978-3-319-27051-7_26

Yao Shen¹⁷,
Ping Lu¹⁸,
Xiaolin Qin¹⁷,
Yuming Qian¹⁸ &
…
Sheng Wang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9483))

Included in the following conference series:

International Conference on Cloud Computing and Security

1729 Accesses

Abstract

In the hybrid architecture of cloud data center, the data division is an important factor that affects the performance of query. For the costly join operations which applies the way of hybrid mapreduce, the overhead of network transmission and I/O is huge that requires large-scale transmission of data across the nodes. In order to reduce the data traffic and improve the efficiency of join queries, this paper proposes an efficient algorithm of Coallocation Parallel Hash Join (CPHJ). First, CPHJ designs a consistent multi-redundant hashing algorithm that distributes the table with join relationship in the cluster according to its join properties, which improves the data locality in the join query processing, but also ensures the availability of the data. Then, On the basis of consistent multi-redundant hashing algorithm, parallel algorithm of join query called ParallelHashJoin is proposed that effectively improves the efficiency of join queries. The CPHJ method applies in the data warehouse system of Alibaba and experimental results indicate that the workpiece ratio of CPHJ in that query is nearly five times more likely than the hive system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the SOSP 2003, pp. 20–43 (2003)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2003)
Article Google Scholar
Yang, H., Dasdan, A., Hsiao, R.L., et al.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1029–1040 (2007)
Google Scholar
Lämmel, R.: Google’s MapReduce programming model ⎯Revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Article MATH Google Scholar
Apache Hive. http://hadoop.apache.org/hive/
Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Google Scholar
Olston, C., Reed, B., Srivastava, U., et al.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
Google Scholar
White, T.: Hadoop: the Definitive Guide. O’Reilly, Sebastopol, CA (2012)
Google Scholar
Apache Hadoop. http://hadoop.apache.org
Murty, J.: Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB. O’Reilly Media Inc., Sebastopol, CA (2009)
Google Scholar
Patten, S.: The S3 Sookbook: Get Cooking with Amazon’s Simple Storage Service. Sopobo (2009)
Google Scholar
Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Syst. Rev. 41(3), 59–72 (2007)
Article Google Scholar
Chaiken, R., Jenkins, B., Larson, P.Å., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endowment 1(2), 1265–1276 (2008)
Article Google Scholar
Yu, Y., Isard, M., Fetterly, D., et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)
Google Scholar
Pike, R., Dorward, S., Griesemer, R., et al.: Interpreting the data: parallel analysis with Sawzall. Sci. Programm. 13(4), 277–298 (2005)
Google Scholar
DeWitt, D.J., Gerber, R.H., Graefe, G., et al.: A High Performance Dataflow Database Machine. Computer Science Department, University of Wisconsin (1986)
Google Scholar
Li, J., Srivastava, J., Rotem, D.: CMD: a multidimensional declustering method for parallel database systems. In: Proceedings of the 18th VLDB Conference, pp. 3–14 (1992)
Google Scholar
Chen, T., Xiao, N., Liu, F., et al.: Clustering-based and consistent hashing-aware data placement algorithm. J. Softw. 21(12), 3175–3185 (2010)
Article Google Scholar
Karger, D., Lehman, E., Leighton, T., et al.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. ACM, pp. 654–663(1997)
Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., et al.: Dynamo: Amazon’s highly available key-value store. In: SOSP, vol. 7, pp. 205–220 (2007)
Google Scholar

Download references

Acknowledgments

Project supported by National Natural Science Foundation of China (61373015, 61300052), the National High Technology Research and Development Program of China (863 Program) (No. 2007AA01Z404), Research Fund for the Doctoral Program of High Education of China (No. 20103218110017), a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), the Fundamental Research Funds for the Central Universities, NUAA (No. NP2013307), Funding of Jiangsu Innovation Program for Graduate Education KYLX_0287, the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 210016, China
Yao Shen, Xiaolin Qin & Sheng Wang
ZTE Corporation, Nanjing Research and Development Center, Nanjing, 210012, China
Ping Lu & Yuming Qian

Authors

Yao Shen
View author publications
You can also search for this author in PubMed Google Scholar
Ping Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yuming Qian
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yao Shen .

Editor information

Editors and Affiliations

Nanjing University of Aeronautics and As, Nanjing, China
Zhiqiu Huang
Nanjing University of Information Scienc, Nanjing, China
Xingming Sun
Nanjing, China
Junzhou Luo
Nanjing University of Aeronautics and As, Nanjing, China
Jian Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, Y., Lu, P., Qin, X., Qian, Y., Wang, S. (2015). Efficient Query Algorithm of Coallocation -Parallel-Hash-Join in the Cloud Data Center. In: Huang, Z., Sun, X., Luo, J., Wang, J. (eds) Cloud Computing and Security. ICCCS 2015. Lecture Notes in Computer Science(), vol 9483. Springer, Cham. https://doi.org/10.1007/978-3-319-27051-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-27051-7_26
Published: 05 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27050-0
Online ISBN: 978-3-319-27051-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics