skip to main content
10.1145/2347673.2347676acmotherconferencesArticle/Chapter ViewAbstractPublication Pagescloud-iConference Proceedingsconference-collections
research-article

MapReduce-based similarity join for metric spaces

Published: 31 August 2012 Publication History

Abstract

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.

References

[1]
Apache. Hadoop. http://hadoop.apache.org/.
[2]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In ACM SIGMOD, 2010.
[3]
C. Böhm, B. Braunmüller, F. Krebs, and H.-P. Kriegel. Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In ACM SIGMOD, 2001.
[4]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[5]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.
[6]
A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.
[7]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.
[8]
G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst., 28:517--580, December 2003.
[9]
E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. Database Syst., 33:7:1--7:38, June 2008.
[10]
D. Jiang, A. K. H. Tung, and G. Chen. Map-join-reduce: Toward scalable and efficient data analysis on large clusters. IEEE Trans. on Knowl. and Data Eng., 23:1299--1311, September 2011.
[11]
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In ACM SIGMOD, 2011.
[12]
Y. N. Silva, W. G. Aref, and M. H. Ali. The similarity join database operator. In ICDE, 2010.
[13]
Y. N. Silva and J. M. Reed. Exploiting mapreduce-based similarity joins. In ACM SIGMOD, 2012.
[14]
Y. N. Silva, J. M. Reed, and L. M. Tsosie. Technical report: Mapreduce-based similarity joins. http://www.public.asu.edu/~ynsilva/tr/MRSJTechRep.pdf, 2012.
[15]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In ACM SIGMOD, 2010.
[16]
T. White. Hadoop: The Definitive Guide. Yahoo! Press, 2010.
[17]
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In ACM SIGMOD, 2007.

Cited By

View all
  • (2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
  • (2023)Diversity Similarity Join for Big DataSimilarity Search and Applications10.1007/978-3-031-46994-7_20(238-252)Online publication date: 27-Oct-2023
  • (2023)Adding Result Diversification to $$k$$NN-Based Joins in a Map-Reduce FrameworkDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_5(68-83)Online publication date: 18-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
Cloud-I '12: Proceedings of the 1st International Workshop on Cloud Intelligence
August 2012
59 pages
ISBN:9781450315968
DOI:10.1145/2347673
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hadoop
  2. MapReduce
  3. metric space
  4. similarity join

Qualifiers

  • Research-article

Conference

Cloud-I '12

Acceptance Rates

Cloud-I '12 Paper Acceptance Rate 8 of 15 submissions, 53%;
Overall Acceptance Rate 12 of 23 submissions, 52%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
  • (2023)Diversity Similarity Join for Big DataSimilarity Search and Applications10.1007/978-3-031-46994-7_20(238-252)Online publication date: 27-Oct-2023
  • (2023)Adding Result Diversification to $$k$$NN-Based Joins in a Map-Reduce FrameworkDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_5(68-83)Online publication date: 18-Aug-2023
  • (2022)Pivot-based approximate k-NN similarity joins for big high-dimensional dataInformation Systems10.1016/j.is.2019.06.00687:COnline publication date: 21-Apr-2022
  • (2020)Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ48607.2020.9177610(1-8)Online publication date: Jul-2020
  • (2020)Handling data-skewness in character based string similarity join using HadoopApplied Computing and Informatics10.1016/j.aci.2018.11.00118:1/2(22-44)Online publication date: 4-Aug-2020
  • (2019)Join Algorithms under Apache SparkProceedings of the 2019 5th International Conference on Computer and Technology Applications10.1145/3323933.3324094(56-62)Online publication date: 16-Apr-2019
  • (2019)A Survey on Parallel Join Algorithms Using MapReduce on Hadoop2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT.2019.8717427(381-388)Online publication date: Apr-2019
  • (2019)Scalable Similarity Joins of Tokenized Strings2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00193(1766-1777)Online publication date: Apr-2019
  • (2018)Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE.2018.8491658(1-7)Online publication date: Jul-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media