skip to main content
10.1145/2457317.2457382acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Published: 18 March 2013 Publication History

Abstract

The quantity of data in real-world applications is growing significantly while the data quality is still a big problem. Similarity search and similarity join are two important operations to address the poor data quality problem. Although many similarity search and join algorithms have been proposed, they did not utilize the abilities of modern hardware with multi-core processors. It calls for new parallel algorithms to enable multi-core processors to meet the high performance requirement of similarity search and join on big data. To this end, in this paper we propose parallel algorithms to support efficient similarity search and join with edit-distance constraints. We adopt the partition-based framework and extend it to support parallel similarity search and join on multi-core processors. We also develop two novel pruning techniques. We have implemented our algorithms and the experimental results on two real datasets show that our parallel algorithms achieve high performance and obtain good speedup.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[3]
A. Behm, S. Ji, C. Li, and J. Lu. Space-constrained gram-based indexing for efficient approximate string search. In ICDE, pages 604--615, 2009.
[4]
A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In ICDE, pages 888--899, 2011.
[5]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313--324, 2003.
[6]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006.
[7]
D. Deng, G. Li, and J. Feng. Top-k string similarity search with edit-distance constraints. In ICDE, 2013.
[8]
J. Feng and G. Li. Efficient fuzzy type-ahead search in xml data. IEEE Trans. Knowl. Data Eng., 24(5):882--895, 2012.
[9]
J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012.
[10]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[11]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008.
[12]
M. Hadjieleftheriou, N. Koudas, and D. Srivastava. Incremental maintenance of length normalized indexes for approximate string matching. In SIGMOD Conference, pages 429--440, 2009.
[13]
S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive fuzzy keyword search. In WWW, pages 433--439, 2009.
[14]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[15]
C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007.
[16]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[17]
G. Li, J. Feng, and C. Li. Supporting search-as-you-type using sql in databases. IEEE Trans. Knowl. Data Eng., 25(2):461--475, 2013.
[18]
G. Li, S. Ji, C. Li, and J. Feng. Efficient type-ahead search on relational data: a tastier approach. In SIGMOD Conference, pages 695--706, 2009.
[19]
G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead search. VLDB J., 20(4):617--640, 2011.
[20]
G. Li, S. Ji, C. Li, J. Wang, and J. Feng. Efficient fuzzy type-ahead search in tastier. In ICDE, pages 1105--1108, 2010.
[21]
G. Li, J. Wang, C. Li, and J. Feng. Supporting efficient top-k queries in type-ahead search. In SIGIR, pages 355--364, 2012.
[22]
A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.
[23]
J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.
[24]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.
[25]
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.
[26]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010.
[27]
J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.
[28]
J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011.
[29]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.
[30]
W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, pages 759--770, 2009.
[31]
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.
[32]
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.
[33]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.
[34]
X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In SIGMOD Conference, pages 353--364, 2008.
[35]
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010.

Cited By

View all
  • (2022)Toward Efficient Similarity Search under Edit Distance on Hybrid ArchitecturesInformation10.3390/info1310045213:10(452)Online publication date: 26-Sep-2022
  • (2019)COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATABÜYÜK VERİDE METİN BENZERLİK ALGORİTMALARININ VERİ EŞLEME PERFORMANSLARININ KARŞILAŞTIRILMASIMühendislik Bilimleri ve Tasarım Dergisi10.21923/jesd.4670367:3(608-618)Online publication date: 15-Sep-2019
  • (2018)Set Similarity Search for Skewed DataProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196985(63-74)Online publication date: 27-May-2018
  • Show More Cited By

Index Terms

  1. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
      March 2013
      423 pages
      ISBN:9781450315999
      DOI:10.1145/2457317
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 March 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. content filter
      2. parallel algorithms
      3. similarity join
      4. similarity search

      Qualifiers

      • Research-article

      Conference

      EDBT/ICDT '13

      Acceptance Rates

      EDBT '13 Paper Acceptance Rate 7 of 10 submissions, 70%;
      Overall Acceptance Rate 7 of 10 submissions, 70%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Toward Efficient Similarity Search under Edit Distance on Hybrid ArchitecturesInformation10.3390/info1310045213:10(452)Online publication date: 26-Sep-2022
      • (2019)COMPARISON OF THE DATA MATCHING PERFORMANCES OF STRING SIMILARITY ALGORITHMS IN BIG DATABÜYÜK VERİDE METİN BENZERLİK ALGORİTMALARININ VERİ EŞLEME PERFORMANSLARININ KARŞILAŞTIRILMASIMühendislik Bilimleri ve Tasarım Dergisi10.21923/jesd.4670367:3(608-618)Online publication date: 15-Sep-2019
      • (2018)Set Similarity Search for Skewed DataProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196985(63-74)Online publication date: 27-May-2018
      • (2018)Submodularity of Distributed Join ComputationProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183728(1237-1252)Online publication date: 27-May-2018
      • (2018)Approximate Set Similarity Join Using Many-Core ProcessorsDatabase and Expert Systems Applications10.1007/978-3-319-98812-2_18(214-222)Online publication date: 9-Aug-2018
      • (2017)Human-in-the-loop data integrationProceedings of the VLDB Endowment10.14778/3137765.313783310:12(2006-2017)Online publication date: 1-Aug-2017
      • (2017)Efficient string similarity join in multi-core and distributed systemsPLOS ONE10.1371/journal.pone.017252612:3(e0172526)Online publication date: 9-Mar-2017
      • (2017)LS-Join: Local Similarity Join on String CollectionsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.268746029:9(1928-1942)Online publication date: 1-Sep-2017
      • (2016)On the Complexity of Inner Product Similarity JoinProceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/2902251.2902285(151-164)Online publication date: 15-Jun-2016
      • (2016)String similarity search and joinFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5900-510:3(399-417)Online publication date: 1-Jun-2016
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media