skip to main content
10.1145/2213836.2213935acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
demonstration

Exploiting MapReduce-based similarity joins

Published: 20 May 2012 Publication History

Abstract

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a pre-defined threshold ∈. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper presents MRSimJoin, a multi-round MapReduce based algorithm to efficiently solve the Similarity Join problem. MRSimJoin efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. We have implemented MRSimJoin in Hadoop, a highly used open-source cloud system. We show how this operation can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of MRSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show how MRSimJoin scales in each scenario when important parameters, e.g., ∈, data size and number of cluster nodes, increase. We demonstrate the execution of MRSimJoin queries using an Amazon Elastic Compute Cloud (EC2) cluster.

References

[1]
Dblp bibliography. http://www.informatik.uni-trier.de/~ley/db/.
[2]
Apache. Hadoop. http://hadoop.apache.org/.
[3]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.
[4]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.
[5]
V. Dohnal, C. Gennaro, and P. Zezula. Similarity join in metric spaces using ed-index. In Database and Expert Systems Applications, volume 2736 of Lecture Notes in Computer Science, pages 484--493. 2003.
[6]
A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.
[7]
G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst., 28:517--580, December 2003.
[8]
E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. Database Syst., 33:7:1--7:38, June 2008.
[9]
Y. N. Silva, W. G. Aref, and M. H. Ali. The similarity join database operator. In ICDE, 2010.
[10]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In ACM SIGMOD, 2010.
[11]
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In ACM SIGMOD, 2007.

Cited By

View all
  • (2025)Supporting Efficient Family Joins for Big Data Tables via Multiple Freedom Family IndexIEEE Access10.1109/ACCESS.2025.353569313(21707-21722)Online publication date: 2025
  • (2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
  • (2023)Diversity Similarity Join for Big DataSimilarity Search and Applications10.1007/978-3-031-46994-7_20(238-252)Online publication date: 27-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hadoop
  2. mapreduce
  3. similarity join

Qualifiers

  • Demonstration

Conference

SIGMOD/PODS '12
Sponsor:

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Supporting Efficient Family Joins for Big Data Tables via Multiple Freedom Family IndexIEEE Access10.1109/ACCESS.2025.353569313(21707-21722)Online publication date: 2025
  • (2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
  • (2023)Diversity Similarity Join for Big DataSimilarity Search and Applications10.1007/978-3-031-46994-7_20(238-252)Online publication date: 27-Oct-2023
  • (2022)A Long Command Subsequence Algorithm for Manufacturing Industry Recommendation System with Similarity Connection TechnologyInternational Journal of Mathematical Models and Methods in Applied Sciences10.46300/9101.2022.16.1916(112-118)Online publication date: 17-May-2022
  • (2022)A long command subsequence algorithm for manufacturing industry recommendation systems with similarity connection technologyApplied Mathematics and Nonlinear Sciences10.2478/amns.2021.2.002328:2(789-798)Online publication date: 30-Sep-2022
  • (2022)Pivot-based approximate k-NN similarity joins for big high-dimensional dataInformation Systems10.1016/j.is.2019.06.00687:COnline publication date: 21-Apr-2022
  • (2021)Trend estimation model of students' thought and behavior based on big data2021 2nd International Conference on Computers, Information Processing and Advanced Education10.1145/3456887.3457465(1084-1089)Online publication date: 25-May-2021
  • (2021)Auto-FuzzyJoinProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452824(1064-1076)Online publication date: 9-Jun-2021
  • (2021)Internal and external memory set containment joinThe VLDB Journal10.1007/s00778-020-00644-3Online publication date: 23-Feb-2021
  • (2020)Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ48607.2020.9177610(1-8)Online publication date: Jul-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media