research-article

Approximate string matching by position restricted alignment

Authors:
Manish Patil

Louisiana State University

Louisiana State University
View Profile

,
Xuanting Cai

Louisiana State University

Louisiana State University
View Profile

,
Sharma V. Thankachan

Louisiana State University

Louisiana State University
View Profile

,
Rahul Shah

Louisiana State University

Louisiana State University
View Profile

,
Seung-Jong Park

Louisiana State University

Louisiana State University
View Profile

,
David Foltz

Louisiana State University

Louisiana State University
View Profile

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 WorkshopsMarch 2013Pages 384–391https://doi.org/10.1145/2457317.2457388

Published:18 March 2013Publication History

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

Pages 384–391

ABSTRACT

Given a collection of strings, goal of the approximate string matching is to efficiently find the strings in the collection that are similar to a query string. In this paper, we focus on edit distance as measure to quantify the similarity between two strings. Existing q-gram based methods to address this problem use inverted indexes to index the q-grams of given string collection. These methods begin by generating the q-grams of query string (disjoint or overlapping) and then merge the inverted lists of these q-grams. Several filtering techniques have been proposed so as to segment inverted lists to relatively shorter lists thus reducing the merging cost. We use a filtering technique which we call as "position restricted alignment" that combines well known length filtering and position filtering to provide more aggressive pruning. We then provide an indexing scheme that integrates the inverted lists storage with the proposed filter thus enabling us to auto-filter the inverted lists. We evaluate the effectiveness of the proposed approach by thorough experimentation.

References

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313--324, 2003. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don't cares. In STOC, pages 91--100, 2004. Google ScholarDigital Library
D. Deng, G. Li, and J. Feng. Top-k string similarity search with edit-distance constraints. In ICDE, 2013.Google ScholarDigital Library
J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarDigital Library
R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-Compressed Text Indexes. In Proceedings of Symposium on Discrete Algorithms, pages 841--850, 2003. Google ScholarDigital Library
T. Kahveci and A. K. Singh. Efficient index structures for string databases. In VLDB, pages 351--360, 2001. Google ScholarDigital Library
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006. Google ScholarDigital Library
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarDigital Library
C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007. Google ScholarDigital Library
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarDigital Library
S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proceedings of Symposium on Discrete Algorithms, pages 657--666, 2002. Google ScholarDigital Library
G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1), 2001. Google ScholarDigital Library
E. Ohlebusch, J. Fischer, and S. Gog. Cst++. In SPIRE, pages 322--333, 2010. Google ScholarDigital Library
R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees and Multisets. In Proceedings of Symposium on Discrete Algorithms, pages 233--242, 2002. Google ScholarDigital Library
K. Ramasamy, J. M. Patel, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly. In VLDB, pages 351--362, 2000. Google ScholarDigital Library
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarDigital Library
P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of Symposium on Switching and Automata Theory, pages 1--11, 1973. Google ScholarDigital Library
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008. Google ScholarDigital Library
Z. Yang, J. Yu, and M. Kitsuregawa. Fast algorithms for top-k approximate string matching. In AAAI, 2010.Google ScholarCross Ref
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010. Google ScholarDigital Library

Index Terms

Approximate string matching by position restricted alignment
1. Information systems
  1. Information systems applications
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Restricted transposition invariant approximate string matching under edit distance
SPIRE'05: Proceedings of the 12th international conference on String Processing and Information Retrieval

Let A and B be strings with lengths m and n, respectively, over a finite integer alphabet. Two classic string mathing problems are computing the edit distance between A and B, and searching for approximate occurrences of A inside B. We consider the ...
Read More
Compressed Indexes for Approximate String Matching

We revisit the problem of indexing a string S[1..n] to support finding all substrings in S that match a given pattern P[1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(mk) time for searching. ...
Read More
The Max-Shift Algorithm for Approximate String Matching
WAE '01: Proceedings of the 5th International Workshop on Algorithm Engineering

The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
March 2013
423 pages
ISBN:9781450315999
DOI:10.1145/2457317
General Chair:
Giovanna Guerrini
Università di Genova, Italy
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
EDBT '13 Paper Acceptance Rate7of10submissions,70%Overall Acceptance Rate7of10submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 142
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Approximate string matching by position restricted alignment

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

ABSTRACT

References

Cited By

Index Terms

Recommendations

Restricted transposition invariant approximate string matching under edit distance

Compressed Indexes for Approximate String Matching

The Max-Shift Algorithm for Approximate String Matching