ABSTRACT
This paper serves as a report for the participation of Special Interest Group In Data (SIGDATA), Indian Institute of Technology, Kanpur in the String Similarity Workshop, EDBT, 2013. We present a novel technique to efficiently process edit distance based string similarity queries. Our technique draws upon some previously conducted works in the field and introduces new methods to tackle the issues therein. We focus on achieving minimum possible execution time while being rather liberal with memory consumption. We propose and support the use of deletion neighborhoods for fast edit distance lookups in dictionaries. Our work emphasizes the power of deletion neighborhoods over other popular finger print based schemes for similarity search queries. Furthermore, we establish that it is possible to reduce the large space requirement of a deletion neighborhood based finger print scheme using simple hashing techniques, thereby making the scheme suitable for practical application. We compare our implementation with the state of the art libraries (Flamingo) and report speed ups of up to an order of magnitude.
- T. Bocek, E. Hunt, and B. Stiller. Fast similarity search in large dictionaries. 2007.Google Scholar
- D. Deng, G. Li, and J. Feng. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In ICDE, pages 762--773, 2012. Google ScholarDigital Library
- C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarDigital Library
- G. Li, D. Deng, and J. Feng. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In SIGMOD Conference, pages 529--540, 2011. Google ScholarDigital Library
- G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarDigital Library
- E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1):132--137, 1985.Google ScholarCross Ref
- J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012. Google ScholarDigital Library
- W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, pages 759--770, 2009. Google ScholarDigital Library
- Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010. Google ScholarDigital Library
Index Terms
- Efficient edit distance based string similarity search using deletion neighborhoods
Recommendations
MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningWe study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = s1, ..., sn, with the goal of answering the following two ...
Weighted Edit Distance Computation: Strings, Trees, and Dyck
STOC 2023: Proceedings of the 55th Annual ACM Symposium on Theory of ComputingGiven two strings of length n over alphabet Σ, and an upper bound k on their edit distance, the algorithm of Myers (Algorithmica’86) and Landau and Vishkin (JCSS’88) from almost forty years back computes the unweighted string edit distance in O(n+k2) ...
Learning String-Edit Distance
In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: The minimum number of insertions, deletions, and substitutions required to transform one string into the ...
Comments