research-article

MapReduce-based similarity join for metric spaces

Authors:

Yasin N. Silva,

Lisa M. TsosieAuthors Info & Claims

Cloud-I '12: Proceedings of the 1st International Workshop on Cloud Intelligence

Article No.: 3, Pages 1 - 8

https://doi.org/10.1145/2347673.2347676

Published: 31 August 2012 Publication History

Abstract

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloud-based Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. MRSimJoin is general enough to be used with data that lies in any metric space, thus it can be used with multiple data types and distance functions. We present guidelines to implement the algorithm in Hadoop, an open-source cloud system. The experimental evaluation of MRSimJoin shows that it has very good execution time and scalability properties.

References

[1]

Apache. Hadoop. http://hadoop.apache.org/.

[2]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In ACM SIGMOD, 2010.

Digital Library

[3]

C. Böhm, B. Braunmüller, F. Krebs, and H.-P. Kriegel. Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In ACM SIGMOD, 2001.

Digital Library

[4]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.

Digital Library

[5]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[6]

A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.

[7]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001.

Digital Library

[8]

G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces (survey article). ACM Trans. Database Syst., 28:517--580, December 2003.

Digital Library

[9]

E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. Database Syst., 33:7:1--7:38, June 2008.

Digital Library

[10]

D. Jiang, A. K. H. Tung, and G. Chen. Map-join-reduce: Toward scalable and efficient data analysis on large clusters. IEEE Trans. on Knowl. and Data Eng., 23:1299--1311, September 2011.

Digital Library

[11]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In ACM SIGMOD, 2011.

Digital Library

[12]

Y. N. Silva, W. G. Aref, and M. H. Ali. The similarity join database operator. In ICDE, 2010.

[13]

Y. N. Silva and J. M. Reed. Exploiting mapreduce-based similarity joins. In ACM SIGMOD, 2012.

Digital Library

[14]

Y. N. Silva, J. M. Reed, and L. M. Tsosie. Technical report: Mapreduce-based similarity joins. http://www.public.asu.edu/~ynsilva/tr/MRSJTechRep.pdf, 2012.

[15]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In ACM SIGMOD, 2010.

Digital Library

[16]

T. White. Hadoop: The Definitive Guide. Yahoo! Press, 2010.

Digital Library

[17]

H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In ACM SIGMOD, 2007.

Digital Library

Cited By

Uhartegaray RD'Orazio LDamigos MKalogeros E(2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
https://doi.org/10.1109/FUZZ52849.2023.10309759
Silva YMartinez JCastro Cea PRazente HNardini Barioni M(2023)Diversity Similarity Join for Big DataSimilarity Search and Applications10.1007/978-3-031-46994-7_20(238-252)Online publication date: 27-Oct-2023
https://doi.org/10.1007/978-3-031-46994-7_20
Souza VCarvalho Lde Oliveira DBedo MSantos L(2023)Adding Result Diversification to $$k$$NN-Based Joins in a Map-Reduce FrameworkDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_5(68-83)Online publication date: 18-Aug-2023
https://doi.org/10.1007/978-3-031-39847-6_5
Show More Cited By

Index Terms

MapReduce-based similarity join for metric spaces
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Exploiting MapReduce-based similarity joins
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
A comparison of join algorithms for log processing in MaPreduce
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

Cloud-I '12: Proceedings of the 1st International Workshop on Cloud Intelligence

August 2012

59 pages

ISBN:9781450315968

DOI:10.1145/2347673

Conference Chairs:
Jérôme Darmont,
Torben Bach Pedersen

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

Cloud-I '12

Cloud-I '12: 1st International Workshop on Cloud Intelligence

August 31, 2012

Istanbul, Turkey

Acceptance Rates

Cloud-I '12 Paper Acceptance Rate 8 of 15 submissions, 53%;

Overall Acceptance Rate 12 of 23 submissions, 52%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
277
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Uhartegaray RD'Orazio LDamigos MKalogeros E(2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
https://doi.org/10.1109/FUZZ52849.2023.10309759
Silva YMartinez JCastro Cea PRazente HNardini Barioni M(2023)Diversity Similarity Join for Big DataSimilarity Search and Applications10.1007/978-3-031-46994-7_20(238-252)Online publication date: 27-Oct-2023
https://doi.org/10.1007/978-3-031-46994-7_20
Souza VCarvalho Lde Oliveira DBedo MSantos L(2023)Adding Result Diversification to $$k$$NN-Based Joins in a Map-Reduce FrameworkDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_5(68-83)Online publication date: 18-Aug-2023
https://doi.org/10.1007/978-3-031-39847-6_5
Čech PLokoč JSilva Y(2022)Pivot-based approximate k-NN similarity joins for big high-dimensional dataInformation Systems10.1016/j.is.2019.06.00687:COnline publication date: 21-Apr-2022
https://dl.acm.org/doi/10.1016/j.is.2019.06.006
TRAN TPHAN TLAURENT AD'ORAZIO L(2020)Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ48607.2020.9177610(1-8)Online publication date: Jul-2020
https://doi.org/10.1109/FUZZ48607.2020.9177610
Meena KTayal DCastillo OJain A(2020)Handling data-skewness in character based string similarity join using HadoopApplied Computing and Informatics10.1016/j.aci.2018.11.00118:1/2(22-44)Online publication date: 4-Aug-2020
https://doi.org/10.1016/j.aci.2018.11.001
Al-Badarneh A(2019)Join Algorithms under Apache SparkProceedings of the 2019 5th International Conference on Computer and Technology Applications10.1145/3323933.3324094(56-62)Online publication date: 16-Apr-2019
https://dl.acm.org/doi/10.1145/3323933.3324094
Barhoush MAlSobeh ARawashdeh A(2019)A Survey on Parallel Join Algorithms Using MapReduce on Hadoop2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)10.1109/JEEIT.2019.8717427(381-388)Online publication date: Apr-2019
https://doi.org/10.1109/JEEIT.2019.8717427
Metwally AHuang C(2019)Scalable Similarity Joins of Tokenized Strings2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00193(1766-1777)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00193
TRAN TPHAN TLAURENT ADrOrazio L(2018)Improving Hamming distance-based fuzzy join in MapReduce using Bloom Filters2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE.2018.8491658(1-7)Online publication date: Jul-2018
https://doi.org/10.1109/FUZZ-IEEE.2018.8491658
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten