research-article

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

Authors:

Jianhua FengAuthors Info & Claims

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 85 - 96

https://doi.org/10.1145/2213836.2213847

Published: 20 May 2012 Publication History

Abstract

As two important operations in data cleaning, similarity join and similarity search have attracted much attention recently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. We have an observation that prefix lengths have significant effect on the performance. Different prefix lengths lead to significantly different performance, and prefix filtering does not always achieve high performance. To address this problem, in this paper we propose an adaptive framework to support similarity join. We propose a cost model to judiciously select an appropriate prefix for each object. To efficiently select prefixes, we devise effective indexes. We extend our method to support similarity search. Experimental results show that our framework beats the prefix-filtering-based framework and achieves high efficiency.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[3]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313--324, 2003.

Digital Library

[4]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.

Digital Library

[5]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[6]

M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.

Digital Library

[7]

M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: selectivity estimators for set similarity selection queries. PVLDB, 1(1):201--212, 2008.

Digital Library

[8]

E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. Database Syst., 33(2), 2008.

Digital Library

[9]

L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages 397--408, 2005.

Digital Library

[10]

M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee. n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In VLDB, pages 325--336, 2005.

Digital Library

[11]

H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009.

Digital Library

[12]

H. Lee, R. T. Ng, and K. Shim. Similarity join size estimation using locality sensitive hashing. PVLDB, 4(6):338--349, 2011.

Digital Library

[13]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, 2008.

Digital Library

[14]

C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.

Digital Library

[15]

G. Li, D. Deng, and J. Feng. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In SIGMOD Conference, pages 529--540, 2011.

Digital Library

[16]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[17]

A. Mazeika, M. H. Böhlen, N. Koudas, and D. Srivastava. Estimating the selectivity of approximate string queries. ACM Trans. Database Syst., 32(2):12, 2007.

Digital Library

[18]

G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001.

Digital Library

[19]

J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.

Digital Library

[20]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[21]

Y. N. Silva, W. G. Aref, and M. H. Ali. The similarity join database operator. In ICDE, pages 892--903, 2010.

[22]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010.

Digital Library

[23]

J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.

Digital Library

[24]

J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011.

Digital Library

[25]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.

Digital Library

[26]

C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.

Digital Library

[27]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008.

Digital Library

[28]

J. Zhai, Y. Lou, and J. Gehrke. Atlas: a probabilistic algorithm for high dimensional similarity search. In SIGMOD Conference, pages 997--1008, 2011.

Digital Library

[29]

Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, 2010.

Digital Library

Cited By

Wu JTang DChalapathi NChambers TCiccolini JPhillips CPickoff-White LParameswaran A(2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685830
Gong YGalhotra SCastro Fernandez R(2024)Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular DataProceedings of the ACM on Management of Data10.1145/36549572:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654957
Zhang SYang JZhang WYang SZhang YLin X(2024)BigSet: An Efficient Set Intersection ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.343259536:12(7677-7691)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3432595
Show More Cited By

Index Terms

Can we beat the prefix filtering?: an adaptive framework for similarity join and search
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Information retrieval query processing

Recommendations

Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

A similarity join aims to find all similar pairs between two collections of records. Established approaches usually deal with synthetic differences like typos and abbreviations, but neglect the semantic relations between words. Such relations, however, ...
A pivotal prefix based filtering algorithm for string similarity search
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate ...
String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

May 2012

886 pages

ISBN:9781450312479

DOI:10.1145/2213836

General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '12

Sponsor:

SIGMOD

SIGMOD/PODS '12: International Conference on Management of Data

May 20 - 24, 2012

Arizona, Scottsdale, USA

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

194
Total Citations
View Citations
1,120
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)2

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu JTang DChalapathi NChambers TCiccolini JPhillips CPickoff-White LParameswaran A(2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685830
Gong YGalhotra SCastro Fernandez R(2024)Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular DataProceedings of the ACM on Management of Data10.1145/36549572:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654957
Zhang SYang JZhang WYang SZhang YLin X(2024)BigSet: An Efficient Set Intersection ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.343259536:12(7677-7691)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3432595
Sevim AEldawy ACarman ECarey MTsotras V(2024)FUDJ: Flexible User-Defined Distributed Joins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00320(4194-4207)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00320
Mann WAugsten NJensen CPawlik M(2024)SWOOP: top-k similarity joins over set streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00880-x34:1Online publication date: 23-Dec-2024
https://dl.acm.org/doi/10.1007/s00778-024-00880-x
Jia LTang JLi MLi RDing JChen Y(2023)A Trie Based Set Similarity Query AlgorithmMathematics10.3390/math1101022911:1(229)Online publication date: 2-Jan-2023
https://doi.org/10.3390/math11010229
Schmitt DKocher DAugsten NMann WMiller A(2023)A Two-Level Signature Scheme for Stable Set Similarity JoinsProceedings of the VLDB Endowment10.14778/3611479.361148016:11(2686-2698)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611480
樊沁(2023)Entity Resolution Algorithm Based on Locality Sensitive Hash and Fuzzy JoinHans Journal of Data Mining10.12677/HJDM.2023.13201113:02(107-116)Online publication date: 2023
https://doi.org/10.12677/HJDM.2023.132011
Peng ZWang ZDeng D(2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
https://doi.org/10.1145/3589324
Fang CSong SGuan HHuang XWang CWang J(2023)Grouping Time Series for Efficient Columnar StorageProceedings of the ACM on Management of Data10.1145/35887031:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588703
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten