Abstract
With the development of Cloud Computing, the Internet of things and some similar technologies, a large amount of data has been produced. MapReduce as a processing architecture for Cloud Computing has been widely used. It can achieve large-scale data processing. However, when connecting two tables on the data processing model of MapReduce, there will be a great deal of data that do not meet the conditions of the connection. These data will also be transferred from the map side to the reduce side. It will bring more time overhead and I/O cost at shuffle stage, which will result in low efficiency. Therefore, how to improve the join query processing algorithm based on the MapReduce has been an urgent problem. In this paper, we put forward two-table join query processing and optimization strategies for the above problems. The optimized method can achieve the expansion of the Bloom Filter. Meanwhile it can reduce the time of shuffle phase, and improve the efficiency of the system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Mishra, P., Erich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24, 63–113 (1992)
Ramakrishnan, R.: Database Management Systems. McGraw -Hill Inc, New York (1997)
Garcia-Molina, H., Widow, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Inc., Upper Saddle River (1999)
Kwan, S.C., Baer, J.-L.: The I/O performance of multiway merge sort and tag sort. IEEE Trans. Comput. 34, 383–387 (1985)
Fushimi, S., Kitsureqawa, M., Tanaka, H.: An overview of the system software of a parallel relational database machine GRACE. In: Proceedings of the Very Large DataBases Conference, pp. 209–219 (1986)
Dewitt, D.J., Katz, R.H., Olken, F., et al.: Implementation techniques for main memory database systems. In: Proceedings of the ACM SIGMOD International Conference, pp. 1–8 (1984)
Stamos, J.W., Young, H.C.: A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4(12), 1345–1354 (1993)
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in mapreduce
Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using mapreduce
Zhang, C., Li, J., Wu, L.: Optimizing Theta-Joins in a mapreduce environment. Int. J. Database Theory Appl. 6(4), 91–108 (2013)
Koumarelas, I.K., Naskos, A., Gounaris, A.: Binary Theta-Joins using mapreduce: efficiency analysis and improvements
Okcan, A., Riedewald, M.: Processing Theta-Joins using mapreduce
White, T.: Hadoop: The Definitive Guide, 2nd edn. O’Reilly Media, Inc., California (2011). pp. 247–249
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD 2010), pp. 975–986 (2010)
Hui, S.: Large data set connection optimization algorithm based on Hadoop framework. Nanjing University of Posts and Telecommunications (2013)
Lin, Y., Agrawal, D, Chun, C., et al.: Llama: leveraging columnar storage for scalable join. In: Proceedings of SIGMOD 2011. ACM, New York (2011)
Yang, H.-C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD 2007), pp. 1029–1040 (2007)
Acknowledgements
This work is supported by National Natural Science Foundation of China under Grant (Nos. 61472169, 61502215); Science Research Normal Fund of Liaoning Province Education Department (L2015193); Doctoral Scientific Research Start Foundation of Liaoning Province (201501127); the Young Research Foundation of Liaoning University under Grant No. LDQN201438.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Wang, J., Pang, J., Li, X., Han, B., Huang, L., Ding, L. (2016). An Efficient Two-Table Join Query Processing Based on Extended Bloom Filter in MapReduce. In: Song, S., Tong, Y. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9998. Springer, Cham. https://doi.org/10.1007/978-3-319-47121-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-47121-1_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47120-4
Online ISBN: 978-3-319-47121-1
eBook Packages: Computer ScienceComputer Science (R0)