Skip to main content
Log in

Set similarity join on massive probabilistic data using MapReduce

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.cs.brown.edu/~hkimura/upi_dataset.html.

References

  1. Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE’12, pp. 498–509 (2012)

    Google Scholar 

  2. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT’10, pp. 99–110 (2010)

    Google Scholar 

  3. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006)

    Google Scholar 

  4. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006)

    Google Scholar 

  5. Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM’10, pp. 731–736 (2010)

    Google Scholar 

  6. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW’07, pp. 131–140 (2007)

    Google Scholar 

  7. Broder, Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. (1997). doi:10.1016/S0169-7552(97)00031-7

    MATH  Google Scholar 

  8. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE’06 (2006)

    Google Scholar 

  9. Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing over uncertain data. In: CIKM’06, pp. 738–747 (2006)

    Google Scholar 

  10. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, pp. 137–150 (2004)

    Google Scholar 

  11. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. (2009) doi:10.1007/s00778-008-0119-9

    Google Scholar 

  12. Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers)’08, pp. 265–268 (2008)

    Google Scholar 

  13. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP’03, pp. 29–43 (2003)

    Google Scholar 

  14. Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR’06, pp. 284–291 (2006)

    Google Scholar 

  15. Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference’10, pp. 327–338 (2010)

    Google Scholar 

  16. Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE, pp. 510–521 (2012)

    Google Scholar 

  17. Kimura, H., Madden, S., Zdonik, S.B.: Upi: a primary index for uncertain databases. In: PVLDB, pp. 630–637 (2010)

    Google Scholar 

  18. Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: DASFAA’06, pp. 295–309 (2006)

    Google Scholar 

  19. Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, pp. 650–659 (2010)

    Google Scholar 

  20. Luo, W., Tan, H., Mao, H., Ni, L.: Efficient similarity joins on massive high-dimensional datasets using MapReduce. In: MDM’12, p. TBA (2012)

    Google Scholar 

  21. Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD Conference’11, pp. 949–960 (2011)

    Google Scholar 

  22. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference’10, pp. 495–506 (2010)

    Google Scholar 

  23. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (2011). doi:10.1145/2000824.2000825

    Google Scholar 

  24. Yang, H.-c., Dasdan, A., Hsiao, R.-L., Stott Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD Conference’07, pp. 1029–1040 (2007)

    Chapter  Google Scholar 

Download references

Acknowledgements

This research was partially supported by the grants from the Natural Science Foundation of China (No. 61070055, 91024032, 91124001); the National 863 High-tech Program (No. 2012AA011001, 2013AA013204); the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University (No. 11XNL010).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youzhong Ma.

Additional information

Communicated by Feifei Li and Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, Y., Meng, X. Set similarity join on massive probabilistic data using MapReduce. Distrib Parallel Databases 32, 447–464 (2014). https://doi.org/10.1007/s10619-013-7137-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7137-3

Keywords

Navigation