Abstract
The classical approximate string-matching problem of finding the locations of approximate occurrences P′ of pattern string P in text string T such that the edit distance between P and P′ is ≤ k is considered. We concentrate on the special case in which T is available for preprocessing before the searches with varying P and k. It is shown how the searches can be done fast using the suffix tree of T augmented with the suffix links as the preprocessed form of T and applying dynamic programming over the tree. Three variations of the search algorithm are developed with running times O(mq + n), O(mq log q + size of the output), and O(m 2 q + size of the output). Here n = ¦T¦, m = ¦P¦, and q varies depending on the problem instance between 0 and n. In the case of the unit cost edit distance it is shown that q = O(min(n, m k+1¦∑¦k)) where ∑ is the alphabet.
This work was supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany).
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. (1990): A basic local alignment search tool. J. of Molecular Biology 215, 403–410.
Baeza-Yates, R. A. & Gonnet, G. H.: All-against-all sequence matching (Extended Abstract).
Blumer,A., Blumer,J., Haussler, D., Ehrenfeucht, A., Chen, M.T. and Seiferas, J. (1985): The smallest automaton recognizing the subwords of a text. Theor. Comp. Sci. 40, 31–55.
Chang, W. & Lampe, J. (1992): Theoretical and empirical comparisons of approximate string matching algorithms. Proc. Combinatorial Pattern Matching 1992, (Tucson, April 1992), Lect. Notes in Computer Science 644 (Springer-Verlag 1992), pp. 175–184.
Chang, W. & Lawler, E (1990): Approximate string matching in sublinear expected time. Proc. IEEE 1990 Ann. Symp. on Foundations of Computer Science, pp. 116–124.
Crochemore, M. (1986): Transducers and repetitions. Theor. Comp. Sci. 45, 63–86.
Crochemore, M. (1988): String matching with constraints. Proc. MFCS'88 Symposium. Lect. Notes in Computer Science 324 (Springer-Verlag 1988), pp. 44–58.
Dowling, G. R. & Hall, P. (1980): Approximate string matching. ACM Comput. Surv. 12, 381–402.
Galil, Z. & Giancarlo, R. (1988): Data structures and algorithms for approximate string matching. J. Complexity 4, 33–72.
Galil, Z. & Park, K. (1989): An improved algorithm for approximate string matching. SIAM J. on Computing 19, 989–999.
Gonnet, G. H. (1992): A tutorial introduction to Computational Biochemistry using Darwin. Informatik E. T. H. Zuerich, Switzerland.
Gonnet, G.H., Baeza-Yates,R.A. & Snider, T. (1991): Lexicographical indices for text: Inverted files vs. PAT trees. Report OED-91-01, UW Centre for the New Oxford English Dictionary and Text Research, 1991.
Jokinen, P. & Ukkonen, E. (1991): Two-algorithms for approximate string matching in static texts. Proc. MFCS'91, Lect. Notes in Computer Science 520 (Springer-Verlag 1991), pp. 240–248.
Landau, G. & Vishkin, U. (1988): Fast string matching with k differences. J. Comp. Syst. Sci. 37, 63–78.
Manber, U. & Myers, G. (1990): Suffix arrays: A new method for on-line string searches. In: SODA-90, pp. 319–327.
McCreight, E. M. (1976): A space economical suffix tree construction algorithm. J. ACM 23, 262–272.
Myers, E. W.: A sublinear algorithm for approximate keyword searching. TR 90-25, Department of Computer Science, The Univ. of Arizona, Tucson (to appear in Algorithmica).
Sellers, P. H. (1980): The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms 1, 359–373.
Tarhio, J. & Ukkonen, E. (1990): Boyer-Moore approach to approximate string matching. 2nd Scand. Workshop on Algorithm Theory, Lect. Notes in Computer Science 447 (Springer-Verlag 1990), pp. 348–359. Full version is to appear in SIAM J. Comput. 22.
Ukkonen, E. (1985): Finding approximate patterns in strings. J. Algorithms 6, 132–137.
Ukkonen, E. (1992): Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191–211.
Ukkonen, E. (1992): Constructing suffix trees on-line in linear time. In: J. van Leeuwen (ed.), Algorithms, Software, Architecture. Information Processing 92, vol. I, pp. 484–492. Elsevier.
Ukkonen, E. & Wood, D.: Approximate string matching with suffix automata. Algorithmica (to appear in 1993).
Wagner, R. A. & Fischer, M. J. (1974): The string-to-string correction problem. J. ACM 21, 168–173.
Weiner, P. (1973): Linear pattern matching algorithms. Proc. 14th IEEE Symp. Switching and Automata Theory, pp. 1–11.
Wu, S. & Manber, U. (1992): Fast text searching allowing errors. Comm. ACM 35, 83–91.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1993 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ukkonen, E. (1993). Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science, vol 684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0029808
Download citation
DOI: https://doi.org/10.1007/BFb0029808
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56764-6
Online ISBN: 978-3-540-47732-7
eBook Packages: Springer Book Archive