An Experimental Survey of MapReduce-Based Similarity Joins

Silva, Yasin N.; Reed, Jason; Brown, Kyle; Wadsworth, Adelbert; Rong, Chuitian

doi:10.1007/978-3-319-46759-7_14

Yasin N. Silva¹⁶,
Jason Reed¹⁶,
Kyle Brown¹⁶,
Adelbert Wadsworth¹⁶ &
…
Chuitian Rong¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9939))

Included in the following conference series:

International Conference on Similarity Search and Applications

1189 Accesses
7 Citations

Abstract

In recent years, Big Data systems and their main data processing framework - MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

This work was supported by Arizona State University’s SRCA and NCUIRE awards, the NSFC (No. 61402329), and the China Scholarship Council.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Silva, Y.N., Aref, W.G., Ali, M.: The similarity join database operator. In: ICDE (2010)
Google Scholar
Silva, Y.N., Pearson, S.: Exploiting database similarity joins for metric spaces. In: VLDB (2012)
Google Scholar
Silva, Y.N., Aly, A.M., Aref, W.G., Larson, P.-A.: SimDB: a similarity-aware database system. In: SIGMOD (2010)
Google Scholar
Silva, Y.N., Aref, W.G., Larson, P.-A., Pearson, S., Ali, M.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. 22(3), 395–420 (2013)
Article Google Scholar
Silva, Y.N., Aref, W.G.: Similarity-aware query processing and optimization. In: VLDB Ph.D. Workshop, France (2009)
Google Scholar
Bernstein, P.A., Jensen, C.S., Tan, K.-L.: A call for surveys. SIGMOD Rec. 41(2), 47 (2012)
Article Google Scholar
Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and efficient parallel processing of massive data sets. In: VLDB (2008)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: SOSP (2003)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys (2007)
Google Scholar
Dohnal, V., Gennaro, C., Zezula, P.: Similarity join in metric spaces using eD-index. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45227-0_48
Chapter Google Scholar
Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: SIGMOD (2001)
Google Scholar
Dittrich, J.-P., Seeger, B.: GESS: a scalable similarity join algorithm for mining large data sets in high dimensional spaces. In: SIGKDD (2001)
Google Scholar
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33, 7:1–7:38 (2008)
Article Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. 29(2), 60–66 (2006)
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB (2001)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD 2010 (2010)
Google Scholar
Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: VLDB/Cloud-I (2012)
Google Scholar
Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD (2012)
Google Scholar
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE (2012)
Google Scholar
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. In: VLDB (2012)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)
Google Scholar
Apache Hadoop. http://hadoop.apache.org/
SimCloud Project: MapReduce-based similarity join survey. http://www.public.asu.edu/~ynsilva/SimCloud/SJSurvey
Harvard Library: Harvard bibliographic dataset. http://library.harvard.edu/open-metadata

Download references

Author information

Authors and Affiliations

Arizona State University, Glendale, AZ, USA
Yasin N. Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth & Chuitian Rong

Authors

Yasin N. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Jason Reed
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Brown
View author publications
You can also search for this author in PubMed Google Scholar
Adelbert Wadsworth
View author publications
You can also search for this author in PubMed Google Scholar
Chuitian Rong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasin N. Silva .

Editor information

Editors and Affiliations

CNRS–IRISA , Rennes, France
Laurent Amsaleg
National Institute of Informatics , Tokyo, Japan
Michael E. Houle
Ludwig-Maximilians-Universität München , München, Germany
Erich Schubert

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, Y.N., Reed, J., Brown, K., Wadsworth, A., Rong, C. (2016). An Experimental Survey of MapReduce-Based Similarity Joins. In: Amsaleg, L., Houle, M., Schubert, E. (eds) Similarity Search and Applications. SISAP 2016. Lecture Notes in Computer Science(), vol 9939. Springer, Cham. https://doi.org/10.1007/978-3-319-46759-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-46759-7_14
Published: 27 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46758-0
Online ISBN: 978-3-319-46759-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics