skip to main content
10.1145/2457317.2457388acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Approximate string matching by position restricted alignment

Published:18 March 2013Publication History

ABSTRACT

Given a collection of strings, goal of the approximate string matching is to efficiently find the strings in the collection that are similar to a query string. In this paper, we focus on edit distance as measure to quantify the similarity between two strings. Existing q-gram based methods to address this problem use inverted indexes to index the q-grams of given string collection. These methods begin by generating the q-grams of query string (disjoint or overlapping) and then merge the inverted lists of these q-grams. Several filtering techniques have been proposed so as to segment inverted lists to relatively shorter lists thus reducing the merging cost. We use a filtering technique which we call as "position restricted alignment" that combines well known length filtering and position filtering to provide more aggressive pruning. We then provide an indexing scheme that integrates the inverted lists storage with the proposed filter thus enabling us to auto-filter the inverted lists. We evaluate the effectiveness of the proposed approach by thorough experimentation.

References

  1. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313--324, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don't cares. In STOC, pages 91--100, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Deng, G. Li, and J. Feng. Top-k string similarity search with edit-distance constraints. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-Compressed Text Indexes. In Proceedings of Symposium on Discrete Algorithms, pages 841--850, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Kahveci and A. K. Singh. Efficient index structures for string databases. In VLDB, pages 351--360, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proceedings of Symposium on Discrete Algorithms, pages 657--666, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Ohlebusch, J. Fischer, and S. Gog. Cst++. In SPIRE, pages 322--333, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets. In Proceedings of Symposium on Discrete Algorithms, pages 233--242, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of Symposium on Switching and Automata Theory, pages 1--11, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Z. Yang, J. Yu, and M. Kitsuregawa. Fast algorithms for top-k approximate string matching. In AAAI, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  23. Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximate string matching by position restricted alignment

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
        March 2013
        423 pages
        ISBN:9781450315999
        DOI:10.1145/2457317

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 March 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        EDBT '13 Paper Acceptance Rate7of10submissions,70%Overall Acceptance Rate7of10submissions,70%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader