skip to main content
10.1145/2457317.2457387acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Efficient edit distance based string similarity search using deletion neighborhoods

Published:18 March 2013Publication History

ABSTRACT

This paper serves as a report for the participation of Special Interest Group In Data (SIGDATA), Indian Institute of Technology, Kanpur in the String Similarity Workshop, EDBT, 2013. We present a novel technique to efficiently process edit distance based string similarity queries. Our technique draws upon some previously conducted works in the field and introduces new methods to tackle the issues therein. We focus on achieving minimum possible execution time while being rather liberal with memory consumption. We propose and support the use of deletion neighborhoods for fast edit distance lookups in dictionaries. Our work emphasizes the power of deletion neighborhoods over other popular finger print based schemes for similarity search queries. Furthermore, we establish that it is possible to reduce the large space requirement of a deletion neighborhood based finger print scheme using simple hashing techniques, thereby making the scheme suitable for practical application. We compare our implementation with the state of the art libraries (Flamingo) and report speed ups of up to an order of magnitude.

References

  1. T. Bocek, E. Hunt, and B. Stiller. Fast similarity search in large dictionaries. 2007.Google ScholarGoogle Scholar
  2. D. Deng, G. Li, and J. Feng. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In ICDE, pages 762--773, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Li, D. Deng, and J. Feng. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In SIGMOD Conference, pages 529--540, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1):132--137, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, pages 759--770, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient edit distance based string similarity search using deletion neighborhoods

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Other conferences
                  EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
                  March 2013
                  423 pages
                  ISBN:9781450315999
                  DOI:10.1145/2457317

                  Copyright © 2013 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 18 March 2013

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  EDBT '13 Paper Acceptance Rate7of10submissions,70%Overall Acceptance Rate7of10submissions,70%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader