Skip to main content

Theoretical and empirical comparisons of approximate string matching algorithms

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1992)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 644))

Included in the following conference series:

Abstract

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier “cut-off” algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(√b-1)) on the average, when alphabet size is taken into consideration.

This research was conducted at the University of California, Berkeley, and was supported in part by Department of Energy grant DE-FG03-90ER60999

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Arratia and M.S. Waterman, Critical Phenomena in Sequence Matching, The Annals of Probability 13:4(1985), pp. 1236–1249.

    Google Scholar 

  2. W.I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, computer program, 1990.

    Google Scholar 

  3. W.I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, U.C. Berkeley, August 1991.

    Google Scholar 

  4. W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 116–124.

    Google Scholar 

  5. W.I. Chang and E.L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster), abstract in Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.

    Google Scholar 

  6. V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences, Technical Report STAN-CS-75-477, Stanford University, Computer Science Department, 1975.

    Google Scholar 

  7. J. Deken, Some Limit Results for Longest Common Subsequences, Discrete Mathematics 26(1979), pp. 17–31. J. Applied Prob. 12(1975), pp. 306–315.

    Google Scholar 

  8. Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching, Journal of Complexity 4(1988), pp. 33–72.

    Google Scholar 

  9. Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching, SIAM J. Comput. 19:6(1990), pp. 989–999.

    Google Scholar 

  10. Z. Galil and K. Park, Dynamic Programming with Convexity, Concavity, and Sparsity, manuscript, October 1990.

    Google Scholar 

  11. D. Gusfield, K. Balasubramanian, J. Bronder, D. Mayfield, D. Naor, Paral: A Method and Computer Package for Optimal String Alignment using Variable Weights, in preparation.

    Google Scholar 

  12. D. Gusfield, K. Balasubramanian and D. Naor, Parametric Optimization of Sequence Alignment, submitted.

    Google Scholar 

  13. P.A.V. Hall and G.R. Dowling, Approximate String Matching, Computing Surveys 12:4(1980), pp. 381–402.

    Google Scholar 

  14. D. Harel and R.E. Tarjan, Fast Algorithms for Finding Nearest Common Ancestors, SIAM J. Comput. 13(1984), pp. 338–355.

    Google Scholar 

  15. N.I. Johnson and S. Kotz, Distributions in Statistics: Discrete Distributions, Houghton Mifflin Company (1969).

    Google Scholar 

  16. P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, manuscript, October 1990.

    Google Scholar 

  17. S. Karlin, F. Ost, and B.E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 133–157.

    Google Scholar 

  18. R.M. Karp, Probabilistic Analysis of Algorithms, lecture notes, U.C. Berkeley (Spring 1988; Fall 1989).

    Google Scholar 

  19. G.M. Landau and U. Vishkin, Fast String Matching with k Differences, J. Comp. Sys. Sci. 37(1988), pp. 63–78.

    Google Scholar 

  20. G.M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching, J. Algorithms 10(1989), pp. 157–169.

    Google Scholar 

  21. G.M. Landau, U. Vishkin, and R. Nussinov, Locating alignments with k differences for nucleotide and amino acid sequences, CABIOS 4:1(1988), pp. 19–24.

    Google Scholar 

  22. V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Phys. Dokl. 6(1966), pp. 126–136.

    Google Scholar 

  23. E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23:2 (1976), pp. 262–272.

    Google Scholar 

  24. U. Manber and S. Wu, Approximate String Matching with Arbitrary Costs for Text and Hypertext, manuscript, February 1990.

    Google Scholar 

  25. E.W. Myers, An O(ND) Difference Algorithm and Its Variations, Algorithmica 1(1986), pp. 252–266.

    Google Scholar 

  26. E.W. Myers, Incremental Alignment Algorithms and Their Applications, SIAM J. Comput., accepted for publication.

    Google Scholar 

  27. D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983).

    Google Scholar 

  28. D. Sankoff and S. Mainville, Common Subsequences and Monotone Subsequences, in D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983), pp. 363–365.

    Google Scholar 

  29. B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization, SIAM J. Comput. 17:6(1988), pp. 1253–1262.

    Google Scholar 

  30. P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition, J. Algorithms 1(1980), pp. 359–373.

    Google Scholar 

  31. J. Tarhio and E. Ukkonen, Approximate Boyer-Moore String Matching, Report A-1990-3, Dept. of Computer Science, University of Helsinki, March 1990.

    Google Scholar 

  32. E. Ukkonen, Algorithms for Approximate String Matching, Inf. Contr. 64(1985), pp. 100–118.

    Google Scholar 

  33. E. Ukkonen, Finding Approximate Patterns in Strings, J. Algorithms 6(1985), pp. 132–137.

    Google Scholar 

  34. E. Ukkonen, personal communications.

    Google Scholar 

  35. E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Dept. of Computer Science, University of Helsinki, April 1990.

    Google Scholar 

  36. M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 53–92.

    Google Scholar 

  37. M.S. Waterman, L. Gordon, and R. Arratia, Phase transitions in sequence matches and nucleic acid structure, Proc. Natl. Acad. Sci. USA 84(1987), pp. 1239–1243.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

© 1992 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, W.I., Lampe, J. (1992). Theoretical and empirical comparisons of approximate string matching algorithms. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1992. Lecture Notes in Computer Science, vol 644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56024-6_14

Download citation

  • DOI: https://doi.org/10.1007/3-540-56024-6_14

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-56024-1

  • Online ISBN: 978-3-540-47357-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics