Abstract
We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.
This research was conducted at the University of California, Berkeley, and was supported in part by Department of Energy grant DE-FG03-90ER60999
Preview
Unable to display preview. Download preview PDF.
References
R. Arratia and M.S. Waterman, Critical Phenomena in Sequence Matching, The Annals of Probability 13:4(1985), pp. 1236–1249.
W.I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, computer program, 1990.
W.I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, U.C. Berkeley, August 1991.
W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 116–124.
W.I. Chang and E.L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster), abstract in Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.
V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences, Technical Report STAN-CS-75-477, Stanford University, Computer Science Department, 1975.
J. Deken, Some Limit Results for Longest Common Subsequences, Discrete Mathematics 26(1979), pp. 17–31. J. Applied Prob. 12(1975), pp. 306–315.
Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching, Journal of Complexity 4(1988), pp. 33–72.
Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching, SIAM J. Comput. 19:6(1990), pp. 989–999.
Z. Galil and K. Park, Dynamic Programming with Convexity, Concavity, and Sparsity, manuscript, October 1990.
D. Gusfield, K. Balasubramanian, J. Bronder, D. Mayfield, D. Naor, Paral: A Method and Computer Package for Optimal String Alignment using Variable Weights, in preparation.
D. Gusfield, K. Balasubramanian and D. Naor, Parametric Optimization of Sequence Alignment, submitted.
P.A.V. Hall and G.R. Dowling, Approximate String Matching, Computing Surveys 12:4(1980), pp. 381–402.
D. Harel and R.E. Tarjan, Fast Algorithms for Finding Nearest Common Ancestors, SIAM J. Comput. 13(1984), pp. 338–355.
N.I. Johnson and S. Kotz, Distributions in Statistics: Discrete Distributions, Houghton Mifflin Company (1969).
P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, manuscript, October 1990.
S. Karlin, F. Ost, and B.E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 133–157.
R.M. Karp, Probabilistic Analysis of Algorithms, lecture notes, U.C. Berkeley (Spring 1988; Fall 1989).
G.M. Landau and U. Vishkin, Fast String Matching with k Differences, J. Comp. Sys. Sci. 37(1988), pp. 63–78.
G.M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching, J. Algorithms 10(1989), pp. 157–169.
G.M. Landau, U. Vishkin, and R. Nussinov, Locating alignments with k differences for nucleotide and amino acid sequences, CABIOS 4:1(1988), pp. 19–24.
V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Phys. Dokl. 6(1966), pp. 126–136.
E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23:2 (1976), pp. 262–272.
U. Manber and S. Wu, Approximate String Matching with Arbitrary Costs for Text and Hypertext, manuscript, February 1990.
E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica 1(1986), pp. 252–266.
E.W. Myers, Incremental Alignment Algorithms and Their Applications, SIAM J. Comput., accepted for publication.
D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983).
D. Sankoff and S. Mainville, Common Subsequences and Monotone Subsequences, in D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983), pp. 363–365.
B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization, SIAM J. Comput. 17:6(1988), pp. 1253–1262.
P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition, J. Algorithms 1(1980), pp. 359–373.
J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, Report A-1990-3, Dept. of Computer Science, University of Helsinki, March 1990.
E. Ukkonen, Algorithms for Approximate String Matching, Inf. Contr. 64(1985), pp. 100–118.
E. Ukkonen, Finding Approximate Patterns in Strings, J. Algorithms 6(1985), pp. 132–137.
E. Ukkonen, personal communications.
E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Dept. of Computer Science, University of Helsinki, April 1990.
M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 53–92.
M.S. Waterman, L. Gordon, and R. Arratia, Phase transitions in sequence matches and nucleic acid structure, Proc. Natl. Acad. Sci. USA 84(1987), pp. 1239–1243.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1992 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, W.I., Lampe, J. (1992). Theoretical and empirical comparisons of approximate string matching algorithms. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_14
Download citation
DOI: https://doi.org/10.1007/3-540-56024-6_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56024-1
Online ISBN: 978-3-540-47357-2
eBook Packages: Springer Book Archive