Abstract
Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an \(O(n\sqrt{\log n}\log |A|)\) -bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog 2 n) bits.
The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log ε n, for 0<ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A|k m k(k+log log n)+occ) and O(log ε n(|A|k m k(k+log log n)+occ)) time using an \(O(n\sqrt{\log n}\log |A|)\) -bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by \(O(2^{\sqrt{\log n}})\) for the \(O(n\sqrt{\log n}\log |A|)\) -bit space data structure.
Similar content being viewed by others
References
Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)
Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k mismatches. In: Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pp. 794–803 (2000)
Baeza-Yates, R.A., Navarro, G.: A practical index for text retrieval allowing errors. In: CLEI, pp. 273–282 (1997)
Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.R.: Range searching over tree cross products. In: Proceedings of the 8th Annual European Symposium on Algorithms, pp. 120–131 (2000)
Cobbs, A.L.: Fast approximate matching using suffix trees. In: Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, pp. 41–54, July 1995
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pp. 91–100 (2004)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the 32nd ACM Symposium on Theory of Computing, pp. 397–406 (2000)
Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proceedings of IEEE Symposium on Foundations of Computer Science, pp. 251–260 (2003)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of Symposium on Foundations of Computer Science, pp. 549–554 (1989)
Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science, pp. 240–248, September 1991
Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. J. Algorithms 10(2), 157–169 (1989)
Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001)
Munro, J.I., Raman, V., Rao, S.S.: Space efficient suffix trees. J. Algorithms 39, 205–222 (2001)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Navarro, G., Baeza-Yates, R.: A new indexing method for approximate string matching. In: Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching, pp. 163–185 (1999)
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–239 (2000)
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pp. 350–365 (2000)
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242 (2002)
Rao, S.S.: Time-space trade-offs for compressed suffix arrays. Inf. Process. Lett. 82, 307–311 (2002)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. (accepted)
Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1, 359–373 (1980)
Shi, F.: Fast approximate string matching with q-blocks sequences. In: Proceedings of the 3rd South American Workshop on String Processing, pp. 257–271. Carleton University Press, Carleton (1996)
Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pp. 50–63 (1996)
Trinh, H.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. In: Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, pp. 434–444 (2004)
Ukkonen, E.: Approximate string-matching over suffix trees. In: Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 228–242 (1993)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space θ(n). Inf. Process. Lett. 17, 81–84 (1983)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lam, TW., Sung, WK. & Wong, SS. Improved Approximate String Matching Using Compressed Suffix Data Structures. Algorithmica 51, 298–314 (2008). https://doi.org/10.1007/s00453-007-9104-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-007-9104-8