research-article

Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Authors:

Jianhua FengAuthors Info & Claims

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

Pages 341 - 348

https://doi.org/10.1145/2457317.2457382

Published: 18 March 2013 Publication History

Abstract

The quantity of data in real-world applications is growing significantly while the data quality is still a big problem. Similarity search and similarity join are two important operations to address the poor data quality problem. Although many similarity search and join algorithms have been proposed, they did not utilize the abilities of modern hardware with multi-core processors. It calls for new parallel algorithms to enable multi-core processors to meet the high performance requirement of similarity search and join on big data. To this end, in this paper we propose parallel algorithms to support efficient similarity search and join with edit-distance constraints. We adopt the partition-based framework and extend it to support parallel similarity search and join on multi-core processors. We also develop two novel pruning techniques. We have implemented our algorithms and the experimental results on two real datasets show that our parallel algorithms achieve high performance and obtain good speedup.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[3]

A. Behm, S. Ji, C. Li, and J. Lu. Space-constrained gram-based indexing for efficient approximate string search. In ICDE, pages 604--615, 2009.

Digital Library

[4]

A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In ICDE, pages 888--899, 2011.

Digital Library

[5]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313--324, 2003.

Digital Library

[6]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.

Digital Library

[7]

D. Deng, G. Li, and J. Feng. Top-k string similarity search with edit-distance constraints. In ICDE, 2013.

Digital Library

[8]

J. Feng and G. Li. Efficient fuzzy type-ahead search in xml data. IEEE Trans. Knowl. Data Eng., 24(5):882--895, 2012.

Digital Library

[9]

J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012.

Digital Library

[10]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[11]

M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.

Digital Library

[12]

M. Hadjieleftheriou, N. Koudas, and D. Srivastava. Incremental maintenance of length normalized indexes for approximate string matching. In SIGMOD Conference, pages 429--440, 2009.

Digital Library

[13]

S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive fuzzy keyword search. In WWW, pages 433--439, 2009.

Digital Library

[14]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[15]

C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.

Digital Library

[16]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[17]

G. Li, J. Feng, and C. Li. Supporting search-as-you-type using sql in databases. IEEE Trans. Knowl. Data Eng., 25(2):461--475, 2013.

Digital Library

[18]

G. Li, S. Ji, C. Li, and J. Feng. Efficient type-ahead search on relational data: a tastier approach. In SIGMOD Conference, pages 695--706, 2009.

Digital Library

[19]

G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead search. VLDB J., 20(4):617--640, 2011.

Digital Library

[20]

G. Li, S. Ji, C. Li, J. Wang, and J. Feng. Efficient fuzzy type-ahead search in tastier. In ICDE, pages 1105--1108, 2010.

[21]

G. Li, J. Wang, C. Li, and J. Feng. Supporting efficient top-k queries in type-ahead search. In SIGIR, pages 355--364, 2012.

Digital Library

[22]

A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.

Digital Library

[23]

J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.

Digital Library

[24]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[25]

B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.

[26]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010.

Digital Library

[27]

J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.

Digital Library

[28]

J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011.

Digital Library

[29]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.

Digital Library

[30]

W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, pages 759--770, 2009.

Digital Library

[31]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.

Digital Library

[32]

C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.

Digital Library

[33]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.

Digital Library

[34]

X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In SIGMOD Conference, pages 353--364, 2008.

Digital Library

[35]

Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010.

Digital Library

Cited By

Khalid MYousaf MSadiq M(2022)Toward Efficient Similarity Search under Edit Distance on Hybrid ArchitecturesInformation10.3390/info1310045213:10(452)Online publication date: 26-Sep-2022
https://doi.org/10.3390/info13100452
AKSOY BUĞUZ SORAL O(2019)COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATABÜYÜK VERİDE METİN BENZERLİK ALGORİTMALARININ VERİ EŞLEME PERFORMANSLARININ KARŞILAŞTIRILMASIMühendislik Bilimleri ve Tasarım Dergisi10.21923/jesd.4670367:3(608-618)Online publication date: 15-Sep-2019
https://doi.org/10.21923/jesd.467036
McCauley SMikkelsen JPagh RVan den Bussche JArenas M(2018)Set Similarity Search for Skewed DataProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196985(63-74)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3196959.3196985
Show More Cited By

Index Terms

Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Information retrieval query processing

Recommendations

MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = s1, ..., sn, with the goal of answering the following two ...
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
Efficient exact edit similarity query processing with the asymmetric signature scheme
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

March 2013

423 pages

ISBN:9781450315999

DOI:10.1145/2457317

General Chair:
Giovanna Guerrini
Università di Genova, Italy

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EDBT/ICDT '13

EDBT/ICDT '13: Joint 2013 EDBT/ICDT Conferences

March 18 - 22, 2013

Genoa, Italy

Acceptance Rates

EDBT '13 Paper Acceptance Rate 7 of 10 submissions, 70%;

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khalid MYousaf MSadiq M(2022)Toward Efficient Similarity Search under Edit Distance on Hybrid ArchitecturesInformation10.3390/info1310045213:10(452)Online publication date: 26-Sep-2022
https://doi.org/10.3390/info13100452
AKSOY BUĞUZ SORAL O(2019)COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATABÜYÜK VERİDE METİN BENZERLİK ALGORİTMALARININ VERİ EŞLEME PERFORMANSLARININ KARŞILAŞTIRILMASIMühendislik Bilimleri ve Tasarım Dergisi10.21923/jesd.4670367:3(608-618)Online publication date: 15-Sep-2019
https://doi.org/10.21923/jesd.467036
McCauley SMikkelsen JPagh RVan den Bussche JArenas M(2018)Set Similarity Search for Skewed DataProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196985(63-74)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3196959.3196985
Li RRiedewald MDeng XDas GJermaine CBernstein P(2018)Submodularity of Distributed Join ComputationProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183728(1237-1252)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3183728
Sugano KAmagasa TKitagawa H(2018)Approximate Set Similarity Join Using Many-Core ProcessorsDatabase and Expert Systems Applications10.1007/978-3-319-98812-2_18(214-222)Online publication date: 9-Aug-2018
https://doi.org/10.1007/978-3-319-98812-2_18
Li G(2017)Human-in-the-loop data integrationProceedings of the VLDB Endowment10.14778/3137765.313783310:12(2006-2017)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.14778/3137765.3137833
Yan CZhao XZhang QHuang Y(2017)Efficient string similarity join in multi-core and distributed systemsPLOS ONE10.1371/journal.pone.017252612:3(e0172526)Online publication date: 9-Mar-2017
https://doi.org/10.1371/journal.pone.0172526
Wang JYang XWang BLiu C(2017)LS-Join: Local Similarity Join on String CollectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.268746029:9(1928-1942)Online publication date: 1-Sep-2017
https://doi.org/10.1109/TKDE.2017.2687460
Ahle TPagh RRazenshteyn ISilvestri FMilo TTan W(2016)On the Complexity of Inner Product Similarity JoinProceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/2902251.2902285(151-164)Online publication date: 15-Jun-2016
https://dl.acm.org/doi/10.1145/2902251.2902285
Yu MLi GDeng DFeng J(2016)String similarity search and joinFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5900-510:3(399-417)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s11704-015-5900-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten