Skip to main content

An Efficient Two-Table Join Query Processing Based on Extended Bloom Filter in MapReduce

  • Conference paper
  • First Online:
  • 933 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9998))

Abstract

With the development of Cloud Computing, the Internet of things and some similar technologies, a large amount of data has been produced. MapReduce as a processing architecture for Cloud Computing has been widely used. It can achieve large-scale data processing. However, when connecting two tables on the data processing model of MapReduce, there will be a great deal of data that do not meet the conditions of the connection. These data will also be transferred from the map side to the reduce side. It will bring more time overhead and I/O cost at shuffle stage, which will result in low efficiency. Therefore, how to improve the join query processing algorithm based on the MapReduce has been an urgent problem. In this paper, we put forward two-table join query processing and optimization strategies for the above problems. The optimized method can achieve the expansion of the Bloom Filter. Meanwhile it can reduce the time of shuffle phase, and improve the efficiency of the system.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Mishra, P., Erich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24, 63–113 (1992)

    Article  Google Scholar 

  2. Ramakrishnan, R.: Database Management Systems. McGraw -Hill Inc, New York (1997)

    MATH  Google Scholar 

  3. Garcia-Molina, H., Widow, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Inc., Upper Saddle River (1999)

    Google Scholar 

  4. Kwan, S.C., Baer, J.-L.: The I/O performance of multiway merge sort and tag sort. IEEE Trans. Comput. 34, 383–387 (1985)

    Article  MathSciNet  Google Scholar 

  5. Fushimi, S., Kitsureqawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine GRACE. In: Proceedings of the Very Large DataBases Conference, pp. 209–219 (1986)

    Google Scholar 

  6. Dewitt, D.J., Katz, R.H., Olken, F., et al.: Implementation techniques for main memory database systems. In: Proceedings of the ACM SIGMOD International Conference, pp. 1–8 (1984)

    Google Scholar 

  7. Stamos, J.W., Young, H.C.: A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4(12), 1345–1354 (1993)

    Article  Google Scholar 

  8. Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in mapreduce

    Google Scholar 

  9. Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce

    Google Scholar 

  10. Zhang, C., Li, J., Wu, L.: Optimizing Theta-Joins in a mapreduce environment. Int. J. Database Theory Appl. 6(4), 91–108 (2013)

    Google Scholar 

  11. Koumarelas, I.K., Naskos, A., Gounaris, A.: Binary Theta-Joins using mapreduce: efficiency analysis and improvements

    Google Scholar 

  12. Okcan, A., Riedewald, M.: Processing Theta-Joins using mapreduce

    Google Scholar 

  13. White, T.: Hadoop: The Definitive Guide, 2nd edn. O’Reilly Media, Inc., California (2011). pp. 247–249

    Google Scholar 

  14. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD 2010), pp. 975–986 (2010)

    Google Scholar 

  15. Hui, S.: Large data set connection optimization algorithm based on Hadoop framework. Nanjing University of Posts and Telecommunications (2013)

    Google Scholar 

  16. Lin, Y., Agrawal, D, Chun, C., et al.: Llama: leveraging columnar storage for scalable join. In: Proceedings of SIGMOD 2011. ACM, New York (2011)

    Google Scholar 

  17. Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD 2007), pp. 1029–1040 (2007)

    Google Scholar 

  18. http://www.tpc.org/tpch/

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under Grant (Nos. 61472169, 61502215); Science Research Normal Fund of Liaoning Province Education Department (L2015193); Doctoral Scientific Research Start Foundation of Liaoning Province (201501127); the Young Research Foundation of Liaoning University under Grant No. LDQN201438.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linlin Ding .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Wang, J., Pang, J., Li, X., Han, B., Huang, L., Ding, L. (2016). An Efficient Two-Table Join Query Processing Based on Extended Bloom Filter in MapReduce. In: Song, S., Tong, Y. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9998. Springer, Cham. https://doi.org/10.1007/978-3-319-47121-1_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47121-1_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47120-4

  • Online ISBN: 978-3-319-47121-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics