ABSTRACT
The string similarity search is the problem of finding similar strings in a given database. Throughout computer engineering, this problem has a number of applications, such as spelling correction, spam filters, and information retrieval. Among the various solutions to this problem, we focus on the distance-space transform, which uses well-known multidimensional spatial data structures such as kD-trees and R*-trees for indexing. This maps strings into k-dimensional vectors whose components are the distances from preselected reference objects (called pivots). In this paper, we further develop the distance-space transform into a more general filtering framework. Based on this framework, we also present an alignment-space transform as an extension of the distance-space transform. Through experiments, we demonstrate the search performance of our proposed method with respect to a variety of search range parameters and pivot selection strategies.
- Taijin Yoon, Sun-Young Park, and Hwan-Gue Cho. A smart filtering system for newly coined profanities by using approximate string alignment. In Proc. of CIT, pages 643--650, 2010. Google ScholarDigital Library
- Jiannan Wang, Jianhua Feng, and Guoliang Li. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In Proc. of VLDB, pages 1219--1230, 2010.Google Scholar
- Jasha Droppo and Alex Acero. Context dependent phonetic string edit distance for automatic speech recognition. In Proc. of ICASSP, pages 4358--4361, 2010.Google ScholarCross Ref
- Jeff Huang and Efthimis N. Efthimiadis. Analyzing and evaluating query reformulation strategies in web search logs. In Proc. of CIKM, pages 77--86, 2009. Google ScholarDigital Library
- Huizhong Duan and Bo-June (Paul) Hsu. Online spelling correction for query completion. In Proc. of WWW, pages 117--126, 2011. Google ScholarDigital Library
- Daniel Karch, Dennis Luxen, and Peter Sanders. Improved fast similarity search in dictionaries. In Proc. of SPIRE, pages 173--178, 2010. Google ScholarDigital Library
- Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenshtein. Dictionary matching and indexing with errors and don't cares. In Proc. of STOC, pages 91--100, 2004. Google ScholarDigital Library
- Luis M. S. Russo and Arlindo L. Oliveira. An efficient algorithm for generating super condensed neighborhoods. In Proc. of CPM, pages 104--115, 2005. Google ScholarDigital Library
- Djamal Belazzougui. Faster and space-optimal edit distance 1 dictionary. In Proc. of CPM, pages 154--167, 2009.Google ScholarCross Ref
- S. Cenk Sahinalp and Murat Tasan. Distance based indexing for string proximity search. In Proc. of ICDE, pages 125--136, 2003.Google ScholarCross Ref
- Barbara Spillmann, Michel Neuhaus, Horst Bunke, Elzbieta Pekalska, and Robert P. W. Duin. Transforming strings to vector spaces using prototype selection. In Proc. of SSPR&SPR, pages 287--296, 2006. Google ScholarDigital Library
- Benjamin Bustos, Gonzalo Navarro, and Edgar Chavez. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters, 24(14): 2357--2366, 2003. Google ScholarDigital Library
- Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and Jose Luis Marroquin. Searching in metric spaces. ACM Computing Surveys, 33(3): 271--321, 2001. Google ScholarDigital Library
- Maria Luisa Mico, Jose Oncina, and Enrique Vidal. A new version of the nearest-neighbour approximating and eliminating search algrorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognition Letters, 15(1): 9--17, 1994. Google ScholarDigital Library
- Raisa Socorro, Luisa Mico, and Jose Oncina. A fast pivot-based indexing algorithm for metric spaces. Pattern Recognition Letters, 32(11): 1511--1516, 2011. Google ScholarDigital Library
- Oscar Pedreira and Nieves R. Brisboa. Spatial selection of sparse pivots for similarity search in metric spaces. In Proc. of SOFSEM, pages 434--445, 2007. Google ScholarDigital Library
- Benjamin Bustos, Oscar Pedreira, and Nieves Brisaboa. A dynamic pivot selection technique for similarty search. In Proc. of SISAP, pages 105--112, 2008. Google ScholarDigital Library
- Rui Mao, Williard L. Miranker, and Daniel P. Miranker. Pivot selection: Dimension reduction for distance-based indexing. Journal of Discrete Algorithms, 13: 32--46, 2012. Google ScholarDigital Library
Index Terms
- Flexible and efficient string similarity search with alignment-space transform
Recommendations
Efficient similarity search
Frontiers of Multimedia ResearchThis chapter addresses one of the fundamental problems involved in multimedia systems, namely efficient similarity search for large collections of multimedia content. This problem has received a lot of attention from various research communities. In ...
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and σ the size of the alphabet. We obtain the following ...
String indexing for top-k close consecutive occurrences
AbstractThe classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this ...
Comments