Improved Approximate String Matching Using Compressed Suffix Data Structures

Lam, Tak-Wah; Sung, Wing-Kin; Wong, Swee-Seong

doi:10.1007/s00453-007-9104-8

Improved Approximate String Matching Using Compressed Suffix Data Structures

Published: 01 November 2007

Volume 51, pages 298–314, (2008)
Cite this article

Algorithmica Aims and scope Submit manuscript

Tak-Wah Lam²,
Wing-Kin Sung¹ &
Swee-Seong Wong¹

153 Accesses
10 Citations
Explore all metrics

Abstract

Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an \(O(n\sqrt{\log n}\log |A|)\) -bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog ² n) bits.

The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log ^ε n, for 0<ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A|^k m ^k(k+log log n)+occ) and O(log ^ε n(|A|^k m ^k(k+log log n)+occ)) time using an \(O(n\sqrt{\log n}\log |A|)\) -bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by \(O(2^{\sqrt{\log n}})\) for the \(O(n\sqrt{\log n}\log |A|)\) -bit space data structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)
Article MATH MathSciNet Google Scholar
Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 794–803 (2000)
Baeza-Yates, R.A., Navarro, G.: A practical index for text retrieval allowing errors. In: CLEI, pp. 273–282 (1997)
Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.R.: Range searching over tree cross products. In: Proceedings of the 8th Annual European Symposium on Algorithms, pp. 120–131 (2000)
Cobbs, A.L.: Fast approximate matching using suffix trees. In: Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, pp. 41–54, July 1995
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pp. 91–100 (2004)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the 32nd ACM Symposium on Theory of Computing, pp. 397–406 (2000)
Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proceedings of IEEE Symposium on Foundations of Computer Science, pp. 251–260 (2003)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of Symposium on Foundations of Computer Science, pp. 549–554 (1989)
Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science, pp. 240–248, September 1991
Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. J. Algorithms 10(2), 157–169 (1989)
Article MATH MathSciNet Google Scholar
Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001)
Article MATH MathSciNet Google Scholar
Munro, J.I., Raman, V., Rao, S.S.: Space efficient suffix trees. J. Algorithms 39, 205–222 (2001)
Article MATH MathSciNet Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Baeza-Yates, R.: A new indexing method for approximate string matching. In: Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching, pp. 163–185 (1999)
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–239 (2000)
MathSciNet Google Scholar
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Google Scholar
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pp. 350–365 (2000)
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242 (2002)
Rao, S.S.: Time-space trade-offs for compressed suffix arrays. Inf. Process. Lett. 82, 307–311 (2002)
Article MATH Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. (accepted)
Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1, 359–373 (1980)
Article MATH MathSciNet Google Scholar
Shi, F.: Fast approximate string matching with q-blocks sequences. In: Proceedings of the 3rd South American Workshop on String Processing, pp. 257–271. Carleton University Press, Carleton (1996)
Google Scholar
Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pp. 50–63 (1996)
Trinh, H.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. In: Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, pp. 434–444 (2004)
Ukkonen, E.: Approximate string-matching over suffix trees. In: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 228–242 (1993)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space θ(n). Inf. Process. Lett. 17, 81–84 (1983)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore, Singapore
Wing-Kin Sung & Swee-Seong Wong
Department of Computer Science, The University of Hong Kong, Hong Kong, Hong Kong
Tak-Wah Lam

Authors

Tak-Wah Lam
View author publications
You can also search for this author in PubMed Google Scholar
Wing-Kin Sung
View author publications
You can also search for this author in PubMed Google Scholar
Swee-Seong Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Swee-Seong Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lam, TW., Sung, WK. & Wong, SS. Improved Approximate String Matching Using Compressed Suffix Data Structures. Algorithmica 51, 298–314 (2008). https://doi.org/10.1007/s00453-007-9104-8

Download citation

Received: 09 February 2006
Accepted: 21 August 2006
Published: 01 November 2007
Issue Date: July 2008
DOI: https://doi.org/10.1007/s00453-007-9104-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Approximate String Matching Using Compressed Suffix Data Structures

Abstract

Access this article

Similar content being viewed by others

An efficient pruning strategy for approximate string matching over suffix tree

Most Recent Match Queries in On-Line Suffix Trees

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved Approximate String Matching Using Compressed Suffix Data Structures

Abstract

Access this article

Similar content being viewed by others

An efficient pruning strategy for approximate string matching over suffix tree

Most Recent Match Queries in On-Line Suffix Trees

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation