Abstract
In the hybrid architecture of cloud data center, the data division is an important factor that affects the performance of query. For the costly join operations which applies the way of hybrid mapreduce, the overhead of network transmission and I/O is huge that requires large-scale transmission of data across the nodes. In order to reduce the data traffic and improve the efficiency of join queries, this paper proposes an efficient algorithm of Coallocation Parallel Hash Join (CPHJ). First, CPHJ designs a consistent multi-redundant hashing algorithm that distributes the table with join relationship in the cluster according to its join properties, which improves the data locality in the join query processing, but also ensures the availability of the data. Then, On the basis of consistent multi-redundant hashing algorithm, parallel algorithm of join query called ParallelHashJoin is proposed that effectively improves the efficiency of join queries. The CPHJ method applies in the data warehouse system of Alibaba and experimental results indicate that the workpiece ratio of CPHJ in that query is nearly five times more likely than the hive system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the SOSP 2003, pp. 20–43 (2003)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2003)
Yang, H., Dasdan, A., Hsiao, R.L., et al.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1029–1040 (2007)
Lämmel, R.: Google’s MapReduce programming model ⎯Revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Apache Hive. http://hadoop.apache.org/hive/
Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)
Olston, C., Reed, B., Srivastava, U., et al.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)
White, T.: Hadoop: the Definitive Guide. O’Reilly, Sebastopol, CA (2012)
Apache Hadoop. http://hadoop.apache.org
Murty, J.: Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB. O’Reilly Media Inc., Sebastopol, CA (2009)
Patten, S.: The S3 Sookbook: Get Cooking with Amazon’s Simple Storage Service. Sopobo (2009)
Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Syst. Rev. 41(3), 59–72 (2007)
Chaiken, R., Jenkins, B., Larson, P.Å., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endowment 1(2), 1265–1276 (2008)
Yu, Y., Isard, M., Fetterly, D., et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)
Pike, R., Dorward, S., Griesemer, R., et al.: Interpreting the data: parallel analysis with Sawzall. Sci. Programm. 13(4), 277–298 (2005)
DeWitt, D.J., Gerber, R.H., Graefe, G., et al.: A High Performance Dataflow Database Machine. Computer Science Department, University of Wisconsin (1986)
Li, J., Srivastava, J., Rotem, D.: CMD: a multidimensional declustering method for parallel database systems. In: Proceedings of the 18th VLDB Conference, pp. 3–14 (1992)
Chen, T., Xiao, N., Liu, F., et al.: Clustering-based and consistent hashing-aware data placement algorithm. J. Softw. 21(12), 3175–3185 (2010)
Karger, D., Lehman, E., Leighton, T., et al.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. ACM, pp. 654–663(1997)
DeCandia, G., Hastorun, D., Jampani, M., et al.: Dynamo: Amazon’s highly available key-value store. In: SOSP, vol. 7, pp. 205–220 (2007)
Acknowledgments
Project supported by National Natural Science Foundation of China (61373015, 61300052), the National High Technology Research and Development Program of China (863 Program) (No. 2007AA01Z404), Research Fund for the Doctoral Program of High Education of China (No. 20103218110017), a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), the Fundamental Research Funds for the Central Universities, NUAA (No. NP2013307), Funding of Jiangsu Innovation Program for Graduate Education KYLX_0287, the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Shen, Y., Lu, P., Qin, X., Qian, Y., Wang, S. (2015). Efficient Query Algorithm of Coallocation -Parallel-Hash-Join in the Cloud Data Center. In: Huang, Z., Sun, X., Luo, J., Wang, J. (eds) Cloud Computing and Security. ICCCS 2015. Lecture Notes in Computer Science(), vol 9483. Springer, Cham. https://doi.org/10.1007/978-3-319-27051-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-27051-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27050-0
Online ISBN: 978-3-319-27051-7
eBook Packages: Computer ScienceComputer Science (R0)