research-article

Efficient edit distance based string similarity search using deletion neighborhoods

Authors:
Shashwat Mishra

Indian Institute of Technology, Kanpur, India

Indian Institute of Technology, Kanpur, India
View Profile

,
Tejas Gandhi

Indian Institute of Technology, Kanpur, India

Indian Institute of Technology, Kanpur, India
View Profile

,
Akhil Arora

Indian Institute of Technology, Kanpur, India

Indian Institute of Technology, Kanpur, India
View Profile

,
Arnab Bhattacharya

Indian Institute of Technology, Kanpur, India

Indian Institute of Technology, Kanpur, India
View Profile

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 WorkshopsMarch 2013Pages 375–383https://doi.org/10.1145/2457317.2457387

Published:18 March 2013Publication History

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

Pages 375–383

ABSTRACT

This paper serves as a report for the participation of Special Interest Group In Data (SIGDATA), Indian Institute of Technology, Kanpur in the String Similarity Workshop, EDBT, 2013. We present a novel technique to efficiently process edit distance based string similarity queries. Our technique draws upon some previously conducted works in the field and introduces new methods to tackle the issues therein. We focus on achieving minimum possible execution time while being rather liberal with memory consumption. We propose and support the use of deletion neighborhoods for fast edit distance lookups in dictionaries. Our work emphasizes the power of deletion neighborhoods over other popular finger print based schemes for similarity search queries. Furthermore, we establish that it is possible to reduce the large space requirement of a deletion neighborhood based finger print scheme using simple hashing techniques, thereby making the scheme suitable for practical application. We compare our implementation with the state of the art libraries (Flamingo) and report speed ups of up to an order of magnitude.

References

T. Bocek, E. Hunt, and B. Stiller. Fast similarity search in large dictionaries. 2007.Google Scholar
D. Deng, G. Li, and J. Feng. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In ICDE, pages 762--773, 2012. Google ScholarDigital Library
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarDigital Library
G. Li, D. Deng, and J. Feng. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In SIGMOD Conference, pages 529--540, 2011. Google ScholarDigital Library
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarDigital Library
E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1):132--137, 1985.Google ScholarCross Ref
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012. Google ScholarDigital Library
W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, pages 759--770, 2009. Google ScholarDigital Library
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010. Google ScholarDigital Library

Index Terms

Recommendations

MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = s1, ..., sn, with the goal of answering the following two ...
Read More
Weighted Edit Distance Computation: Strings, Trees, and Dyck
STOC 2023: Proceedings of the 55th Annual ACM Symposium on Theory of Computing

Given two strings of length n over alphabet Σ, and an upper bound k on their edit distance, the algorithm of Myers (Algorithmica’86) and Landau and Vishkin (JCSS’88) from almost forty years back computes the unweighted string edit distance in O(n+k²) ...
Read More
Learning String-Edit Distance

In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: The minimum number of insertions, deletions, and substitutions required to transform one string into the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
March 2013
423 pages
ISBN:9781450315999
DOI:10.1145/2457317
General Chair:
Giovanna Guerrini
Università di Genova, Italy
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deletion neighborhood
edit distance
string similarity
Qualifiers
- research-article
Conference

Acceptance Rates
EDBT '13 Paper Acceptance Rate7of10submissions,70%Overall Acceptance Rate7of10submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 213
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient edit distance based string similarity search using deletion neighborhoods

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

ABSTRACT

References

Cited By

Index Terms

Recommendations

MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance

Weighted Edit Distance Computation: Strings, Trees, and Dyck

Learning String-Edit Distance