Skip to main content

Cost-Based Join Algorithm Selection in Hadoop

  • Conference paper
Book cover Web Information Systems Engineering – WISE 2014 (WISE 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8787))

Included in the following conference series:

Abstract

In recent years, MapReduce has become a popular computing framework for big data analysis. Join is a major query type for data analysis and various algorithms have been designed to process join queries on top of Hadoop. Since the efficiency of different algorithms differs on the join tasks on hand, to achieve a good performance, users need to select an appropriate algorithm and use the algorithm with a proper configuration, which is rather difficult for many end users. This paper proposes a cost model to estimate the cost of four popular join algorithms. Based on the cost model, the system may automatically choose the join algorithm with the least cost, and then give the reasonable configuration values for the chosen algorithm. Experimental results with the TPC-H benchmark verify that the proposed method can correctly choose the best join algorithm, and the chosen algorithm can achieve a speedup of around 1.25 times over the default join algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  2. Hadoop, http://hadoop.apache.org/

  3. Hive, http://hive.apache.org/

  4. TPC-H, Benchmark Specification, http://www.tpc.org/tpch/

  5. HDFS architecture, http://hadoop.apache.org/docs/r0.19.1/hdfs_design.html

  6. Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies 3(1), 1–177 (2010)

    Article  Google Scholar 

  7. Okcan, A., et al.: Processing Theta-Joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD (2011)

    Google Scholar 

  8. Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD, pp. 975–986 (2010)

    Google Scholar 

  9. Lin, Y., et al.: Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In: Proceedings of the 2011 ACM SIGMOD (2011)

    Google Scholar 

  10. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th EDBT, pp. 99–110 (2010)

    Google Scholar 

  11. Vernica, R., et al.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD, pp. 495–506 (2010)

    Google Scholar 

  12. Yang, H., et al.: Map-reduce-merge:simplifiedrelational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD, pp. 1029–1040 (2007)

    Google Scholar 

  13. Yang, H.-c., Parker, D.S.: Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 308–322. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  14. Jiang, D., et al.: Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering (2010)

    Google Scholar 

  15. Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. In: SIGMOD Conference, pp. 961–972 (2011)

    Google Scholar 

  16. Balmin, A., Kaldewey, T., Tata, S.: Clydesdale: structured data processing on hadoop. In: SIGMOD Conference, pp. 705–708 (2012)

    Google Scholar 

  17. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD, pp. 1099–1110 (2008)

    Google Scholar 

  18. Pig, http://pig.apache.org/

  19. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing A SQL Implementation On The MapReduce Framework. PVLDB 4(12), 1318–1327 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Gu, J., Peng, S., Wang, X.S., Rao, W., Yang, M., Cao, Y. (2014). Cost-Based Join Algorithm Selection in Hadoop. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8787. Springer, Cham. https://doi.org/10.1007/978-3-319-11746-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11746-1_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11745-4

  • Online ISBN: 978-3-319-11746-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics