Skip to main content
Log in

Improved Approximate String Matching Using Compressed Suffix Data Structures

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an \(O(n\sqrt{\log n}\log |A|)\) -bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog 2 n) bits.

The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log ε n, for 0<ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A|k m k(k+log log n)+occ) and O(log ε n(|A|k m k(k+log log n)+occ)) time using an \(O(n\sqrt{\log n}\log |A|)\) -bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by \(O(2^{\sqrt{\log n}})\) for the \(O(n\sqrt{\log n}\log |A|)\) -bit space data structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  2. Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 794–803 (2000)

  3. Baeza-Yates, R.A., Navarro, G.: A practical index for text retrieval allowing errors. In: CLEI, pp. 273–282 (1997)

  4. Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.R.: Range searching over tree cross products. In: Proceedings of the 8th Annual European Symposium on Algorithms, pp. 120–131 (2000)

  5. Cobbs, A.L.: Fast approximate matching using suffix trees. In: Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, pp. 41–54, July 1995

  6. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pp. 91–100 (2004)

  7. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the 32nd ACM Symposium on Theory of Computing, pp. 397–406 (2000)

  8. Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proceedings of IEEE Symposium on Foundations of Computer Science, pp. 251–260 (2003)

  9. Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of Symposium on Foundations of Computer Science, pp. 549–554 (1989)

  10. Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science, pp. 240–248, September 1991

  11. Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. J. Algorithms 10(2), 157–169 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  12. Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  13. Munro, J.I., Raman, V., Rao, S.S.: Space efficient suffix trees. J. Algorithms 39, 205–222 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  14. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  15. Navarro, G., Baeza-Yates, R.: A new indexing method for approximate string matching. In: Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching, pp. 163–185 (1999)

  16. Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–239 (2000)

    MathSciNet  Google Scholar 

  17. Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)

    Google Scholar 

  18. Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pp. 350–365 (2000)

  19. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242 (2002)

  20. Rao, S.S.: Time-space trade-offs for compressed suffix arrays. Inf. Process. Lett. 82, 307–311 (2002)

    Article  MATH  Google Scholar 

  21. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. (accepted)

  22. Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1, 359–373 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  23. Shi, F.: Fast approximate string matching with q-blocks sequences. In: Proceedings of the 3rd South American Workshop on String Processing, pp. 257–271. Carleton University Press, Carleton (1996)

    Google Scholar 

  24. Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pp. 50–63 (1996)

  25. Trinh, H.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. In: Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, pp. 434–444 (2004)

  26. Ukkonen, E.: Approximate string-matching over suffix trees. In: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 228–242 (1993)

  27. Willard, D.E.: Log-logarithmic worst-case range queries are possible in space θ(n). Inf. Process. Lett. 17, 81–84 (1983)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Swee-Seong Wong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lam, TW., Sung, WK. & Wong, SS. Improved Approximate String Matching Using Compressed Suffix Data Structures. Algorithmica 51, 298–314 (2008). https://doi.org/10.1007/s00453-007-9104-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-007-9104-8

Keywords

Navigation