Set similarity join on massive probabilistic data using MapReduce

Ma, Youzhong; Meng, Xiaofeng

doi:10.1007/s10619-013-7137-3

Set similarity join on massive probabilistic data using MapReduce

Published: 03 December 2013

Volume 32, pages 447–464, (2014)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Youzhong Ma¹ &
Xiaofeng Meng¹

575 Accesses
3 Citations
Explore all metrics

Abstract

In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Article 08 April 2024

H. S. Jennath & S. Asharaf

Notes

http://www.cs.brown.edu/~hkimura/upi_dataset.html.

References

Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE’12, pp. 498–509 (2012)
Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT’10, pp. 99–110 (2010)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006)
Google Scholar
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM’10, pp. 731–736 (2010)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW’07, pp. 131–140 (2007)
Google Scholar
Broder, Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. (1997). doi:10.1016/S0169-7552(97)00031-7
MATH Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE’06 (2006)
Google Scholar
Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing over uncertain data. In: CIKM’06, pp. 738–747 (2006)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, pp. 137–150 (2004)
Google Scholar
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. (2009) doi:10.1007/s00778-008-0119-9
Google Scholar
Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers)’08, pp. 265–268 (2008)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP’03, pp. 29–43 (2003)
Google Scholar
Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR’06, pp. 284–291 (2006)
Google Scholar
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference’10, pp. 327–338 (2010)
Google Scholar
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE, pp. 510–521 (2012)
Google Scholar
Kimura, H., Madden, S., Zdonik, S.B.: Upi: a primary index for uncertain databases. In: PVLDB, pp. 630–637 (2010)
Google Scholar
Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: DASFAA’06, pp. 295–309 (2006)
Google Scholar
Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, pp. 650–659 (2010)
Google Scholar
Luo, W., Tan, H., Mao, H., Ni, L.: Efficient similarity joins on massive high-dimensional datasets using MapReduce. In: MDM’12, p. TBA (2012)
Google Scholar
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD Conference’11, pp. 949–960 (2011)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference’10, pp. 495–506 (2010)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (2011). doi:10.1145/2000824.2000825
Google Scholar
Yang, H.-c., Dasdan, A., Hsiao, R.-L., Stott Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD Conference’07, pp. 1029–1040 (2007)
Chapter Google Scholar

Download references

Acknowledgements

This research was partially supported by the grants from the Natural Science Foundation of China (No. 61070055, 91024032, 91124001); the National 863 High-tech Program (No. 2012AA011001, 2013AA013204); the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University (No. 11XNL010).

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, China
Youzhong Ma & Xiaofeng Meng

Authors

Youzhong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youzhong Ma.

Additional information

Communicated by Feifei Li and Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, Y., Meng, X. Set similarity join on massive probabilistic data using MapReduce. Distrib Parallel Databases 32, 447–464 (2014). https://doi.org/10.1007/s10619-013-7137-3

Download citation

Published: 03 December 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10619-013-7137-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Set similarity join on massive probabilistic data using MapReduce

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Set similarity join on massive probabilistic data using MapReduce

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation