skip to main content
10.1145/2448556.2448632acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Flexible and efficient string similarity search with alignment-space transform

Published:17 January 2013Publication History

ABSTRACT

The string similarity search is the problem of finding similar strings in a given database. Throughout computer engineering, this problem has a number of applications, such as spelling correction, spam filters, and information retrieval. Among the various solutions to this problem, we focus on the distance-space transform, which uses well-known multidimensional spatial data structures such as kD-trees and R*-trees for indexing. This maps strings into k-dimensional vectors whose components are the distances from preselected reference objects (called pivots). In this paper, we further develop the distance-space transform into a more general filtering framework. Based on this framework, we also present an alignment-space transform as an extension of the distance-space transform. Through experiments, we demonstrate the search performance of our proposed method with respect to a variety of search range parameters and pivot selection strategies.

References

  1. Taijin Yoon, Sun-Young Park, and Hwan-Gue Cho. A smart filtering system for newly coined profanities by using approximate string alignment. In Proc. of CIT, pages 643--650, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jiannan Wang, Jianhua Feng, and Guoliang Li. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In Proc. of VLDB, pages 1219--1230, 2010.Google ScholarGoogle Scholar
  3. Jasha Droppo and Alex Acero. Context dependent phonetic string edit distance for automatic speech recognition. In Proc. of ICASSP, pages 4358--4361, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jeff Huang and Efthimis N. Efthimiadis. Analyzing and evaluating query reformulation strategies in web search logs. In Proc. of CIKM, pages 77--86, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Huizhong Duan and Bo-June (Paul) Hsu. Online spelling correction for query completion. In Proc. of WWW, pages 117--126, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Daniel Karch, Dennis Luxen, and Peter Sanders. Improved fast similarity search in dictionaries. In Proc. of SPIRE, pages 173--178, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenshtein. Dictionary matching and indexing with errors and don't cares. In Proc. of STOC, pages 91--100, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Luis M. S. Russo and Arlindo L. Oliveira. An efficient algorithm for generating super condensed neighborhoods. In Proc. of CPM, pages 104--115, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Djamal Belazzougui. Faster and space-optimal edit distance 1 dictionary. In Proc. of CPM, pages 154--167, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Cenk Sahinalp and Murat Tasan. Distance based indexing for string proximity search. In Proc. of ICDE, pages 125--136, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  11. Barbara Spillmann, Michel Neuhaus, Horst Bunke, Elzbieta Pekalska, and Robert P. W. Duin. Transforming strings to vector spaces using prototype selection. In Proc. of SSPR&SPR, pages 287--296, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Benjamin Bustos, Gonzalo Navarro, and Edgar Chavez. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters, 24(14): 2357--2366, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and Jose Luis Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3): 271--321, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Maria Luisa Mico, Jose Oncina, and Enrique Vidal. A new version of the nearest-neighbour approximating and eliminating search algrorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognition Letters, 15(1): 9--17, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Raisa Socorro, Luisa Mico, and Jose Oncina. A fast pivot-based indexing algorithm for metric spaces. Pattern Recognition Letters, 32(11): 1511--1516, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Oscar Pedreira and Nieves R. Brisboa. Spatial selection of sparse pivots for similarity search in metric spaces. In Proc. of SOFSEM, pages 434--445, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Benjamin Bustos, Oscar Pedreira, and Nieves Brisaboa. A dynamic pivot selection technique for similarty search. In Proc. of SISAP, pages 105--112, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rui Mao, Williard L. Miranker, and Daniel P. Miranker. Pivot selection: Dimension reduction for distance-based indexing. Journal of Discrete Algorithms, 13: 32--46, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Flexible and efficient string similarity search with alignment-space transform

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
        January 2013
        772 pages
        ISBN:9781450319584
        DOI:10.1145/2448556

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 January 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate251of941submissions,27%
      • Article Metrics

        • Downloads (Last 12 months)2
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader