Skip to main content

Efficient Query Algorithm of Coallocation -Parallel-Hash-Join in the Cloud Data Center

  • Conference paper
  • First Online:
Cloud Computing and Security (ICCCS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9483))

Included in the following conference series:

  • 1729 Accesses

Abstract

In the hybrid architecture of cloud data center, the data division is an important factor that affects the performance of query. For the costly join operations which applies the way of hybrid mapreduce, the overhead of network transmission and I/O is huge that requires large-scale transmission of data across the nodes. In order to reduce the data traffic and improve the efficiency of join queries, this paper proposes an efficient algorithm of Coallocation Parallel Hash Join (CPHJ). First, CPHJ designs a consistent multi-redundant hashing algorithm that distributes the table with join relationship in the cluster according to its join properties, which improves the data locality in the join query processing, but also ensures the availability of the data. Then, On the basis of consistent multi-redundant hashing algorithm, parallel algorithm of join query called ParallelHashJoin is proposed that effectively improves the efficiency of join queries. The CPHJ method applies in the data warehouse system of Alibaba and experimental results indicate that the workpiece ratio of CPHJ in that query is nearly five times more likely than the hive system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the SOSP 2003, pp. 20–43 (2003)

    Google Scholar 

  2. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2003)

    Article  Google Scholar 

  3. Yang, H., Dasdan, A., Hsiao, R.L., et al.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1029–1040 (2007)

    Google Scholar 

  4. Lämmel, R.: Google’s MapReduce programming model ⎯Revisited. Sci. Comput. Program. 70(1), 1–30 (2008)

    Article  MATH  Google Scholar 

  5. Apache Hive. http://hadoop.apache.org/hive/

  6. Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  7. Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)

    Google Scholar 

  8. Olston, C., Reed, B., Srivastava, U., et al.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1099–1110. ACM (2008)

    Google Scholar 

  9. White, T.: Hadoop: the Definitive Guide. O’Reilly, Sebastopol, CA (2012)

    Google Scholar 

  10. Apache Hadoop. http://hadoop.apache.org

  11. Murty, J.: Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB. O’Reilly Media Inc., Sebastopol, CA (2009)

    Google Scholar 

  12. Patten, S.: The S3 Sookbook: Get Cooking with Amazon’s Simple Storage Service. Sopobo (2009)

    Google Scholar 

  13. Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Syst. Rev. 41(3), 59–72 (2007)

    Article  Google Scholar 

  14. Chaiken, R., Jenkins, B., Larson, P.Å., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endowment 1(2), 1265–1276 (2008)

    Article  Google Scholar 

  15. Yu, Y., Isard, M., Fetterly, D., et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)

    Google Scholar 

  16. Pike, R., Dorward, S., Griesemer, R., et al.: Interpreting the data: parallel analysis with Sawzall. Sci. Programm. 13(4), 277–298 (2005)

    Google Scholar 

  17. DeWitt, D.J., Gerber, R.H., Graefe, G., et al.: A High Performance Dataflow Database Machine. Computer Science Department, University of Wisconsin (1986)

    Google Scholar 

  18. Li, J., Srivastava, J., Rotem, D.: CMD: a multidimensional declustering method for parallel database systems. In: Proceedings of the 18th VLDB Conference, pp. 3–14 (1992)

    Google Scholar 

  19. Chen, T., Xiao, N., Liu, F., et al.: Clustering-based and consistent hashing-aware data placement algorithm. J. Softw. 21(12), 3175–3185 (2010)

    Article  Google Scholar 

  20. Karger, D., Lehman, E., Leighton, T., et al.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. ACM, pp. 654–663(1997)

    Google Scholar 

  21. DeCandia, G., Hastorun, D., Jampani, M., et al.: Dynamo: Amazon’s highly available key-value store. In: SOSP, vol. 7, pp. 205–220 (2007)

    Google Scholar 

Download references

Acknowledgments

Project supported by National Natural Science Foundation of China (61373015, 61300052), the National High Technology Research and Development Program of China (863 Program) (No. 2007AA01Z404), Research Fund for the Doctoral Program of High Education of China (No. 20103218110017), a project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), the Fundamental Research Funds for the Central Universities, NUAA (No. NP2013307), Funding of Jiangsu Innovation Program for Graduate Education KYLX_0287, the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yao Shen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Shen, Y., Lu, P., Qin, X., Qian, Y., Wang, S. (2015). Efficient Query Algorithm of Coallocation -Parallel-Hash-Join in the Cloud Data Center. In: Huang, Z., Sun, X., Luo, J., Wang, J. (eds) Cloud Computing and Security. ICCCS 2015. Lecture Notes in Computer Science(), vol 9483. Springer, Cham. https://doi.org/10.1007/978-3-319-27051-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27051-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27050-0

  • Online ISBN: 978-3-319-27051-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics