Skip to main content

Approximate string matching and local similarity

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1994)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 807))

Included in the following conference series:

Abstract

The best known rigorous method for biological sequence comparison has been the algorithm of Smith and Waterman. It computes in quadratic time the highest scoring local alignment of two sequences given a (nonmetric) similarity measure and gap penalty. In this paper, we describe how the distance-based sublinear expected time algorithm of Chang and Lawler can be extended to solve efficiently the local similarity problem. We present both a new theoretical result, polynomialspace, constant-fraction-error matching that is provably optimal, and a practical adaptation of it that produces nearly identical results as Smith-Waterman, at speedups of 2X (PAM 120, roughly corresponding to 33% identity) to 10X (PAM 90, 50% identity) or better. Further improvements are anticipated. What makes this possible is the addition of a new constraint on unit score (average score per residue), which filters out both very short alignments and very long alignments with unacceptably low average. This program is part of a package called Genome Analyst that is being developed at CSHL.

Supported by Department of Energy grant DE-FG02-91ER61190 and National Institutes of Health grant 1R01 HG0020301A1 to T.G. Marr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S.F. Altschul, Amino Acid Substitution Matrices from an Information Theoretic Perspective, J. Molecular Biology, 219(1991), pp. 555–565.

    Google Scholar 

  2. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, A Basic Local Alignment Search Tool, J. Molecular Biology, 215(1990), pp. 403–410.

    Google Scholar 

  3. P. Argos and M. Vingron, Sensitive Comparison of Protein Amino Acid Sequences, in R.F. Doolittle, ed. Methods in Enzymology Volume 183, Academic Press (1990), pp. 352–365.

    Google Scholar 

  4. P. Argos, M. Vingron, and G. Vogt, Protein sequence comparison: methods and significance, Protein Engineering 4(1991), pp. 375–383.

    Google Scholar 

  5. W.I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, U.C. Berkeley, August 1991. Also available as Computer Science Division Reports UCB/CSD 91/653–654.

    Google Scholar 

  6. W.I. Chang and J. Lampe, Theoretical and Empirical Comparisons of Approximate String Matching Algorithms, Proc. Combinatorial Pattern Matching '92, Tucson, AZ, April 29-May 1, 1992, Lecture Notes in Computer Science 644, Springer-Verlag, pp. 172–181.

    Google Scholar 

  7. W.I. Chang and E.L. Lawler, Approximate String Matching in Sublinear Expected Time, Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, Oct. 22–24, 1990, pp. 116–124.

    Google Scholar 

  8. W.I. Chang and W.L. Lawler, Sublinear Expected Time Approximate String Matching and Biological Applications, Algorithmica, in press.

    Google Scholar 

  9. V. Chvátal and D. Sankoff, Longest Common Subsequences of Two Random Sequences, Technical Report STAN-CS-75-477, Stanford University, Computer Science Department, 1975.

    Google Scholar 

  10. M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, A Model of Evolutionary Change in Proteins, in M.O. Dayhoff, ed., Atlas of Protein Sequence and Structure vol. 5. suppl. 3., Nat. Biomed. Res. Found., Washington, D.C., pp. 345–352, 1979.

    Google Scholar 

  11. R.F. Doolittle, ed. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods in Enzymology Volume 183, Academic Press (1990).

    Google Scholar 

  12. D.G. George, W.C. Barker, and L.T. Hunt, Mutation Data Matrix and Its Uses, in R.F. Doolittle, ed. Methods in Enzymology Volume 183, Academic Press (1990), pp. 333–351.

    Google Scholar 

  13. W.B. Goad and M.I. Kanehisa, Pattern Recognition in Nucleic Acid Sequences I, A General Method for Finding Local Homologies and Symmetries, Nucl. Acids Res. 10(1982), pp. 247–263.

    Google Scholar 

  14. O. Gotoh, An improved algorithm for matching biological sequences, J. Mol. Biol. 162(1982), pp. 705–708.

    Google Scholar 

  15. X. Huang, A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps, Genomics, 1992.

    Google Scholar 

  16. X. Huang and W. Miller, A time-efficient, linear-space local similarity algorithm. Advances in Applied Mathematics 12(1991), pp. 337–357.

    Google Scholar 

  17. S. Karlin and S.F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Nat. Acad. Sci., USA, 87(1990), 2264–2268.

    Google Scholar 

  18. D.E. Knuth, J.H. Morris, and V.R. Pratt, Fast Pattern Matching in Strings, SIAM J. Comput. 6:2 (1977), pp. 323–350.

    Google Scholar 

  19. G.M. Landau and U. Vishkin, Fast String Matching with k Differences, J. Comp. Sys. Sci. 37(1988), pp. 63–78.

    Google Scholar 

  20. E.W. Myers (1991a), A Sublinear Algorithm for Approximate Keyword Matching, Technical Report TR90-25, Computer Science Dept., University of Arizona, Tucson, September 1991.

    Google Scholar 

  21. E.W. Myers (1991b), An Overview of Sequence Comparison Algorithms in Molecular Biology, Technical Report TR91-29, Computer Science Dept., University of Arizona, Tucson, December 1991.

    Google Scholar 

  22. E.W. Myers, Algorithmic Advances for Searching Biosequence Databases, to appear in S. Suhai, ed., Computational Methods in Genome Research, Plenum Press (1994).

    Google Scholar 

  23. S.B. Needleman and C.E. Wunsch, A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins, J. Mol. Biol. 48(1970), pp. 443–453.

    Google Scholar 

  24. W.R. Pearson, Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms, Genomics 11(1991), pp. 635–650.

    Google Scholar 

  25. W.R. Pearson and D.J. Lipman, Improved tools for biological sequence comparison, Proc. Natl. Acad. Sci. USA 85(1988), pp. 2444–2448.

    Google Scholar 

  26. D. Sankoff and J.B. Kruskal, eds., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley (1983).

    Google Scholar 

  27. P.H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition, J. Algorithms 1(1980), pp. 359–373.

    Google Scholar 

  28. P.H. Sellers, Pattern Recognition in Genetic Sequences by Mismatch Density, Bull. Math. Biol. 46(1984), pp. 501–514.

    Google Scholar 

  29. T.F. Smith and M.S. Waterman, Identification of Common Molecular Subsequences, J. Mol. Biol. 147(1981), pp. 195–197.

    Google Scholar 

  30. S.S. Sturrock and J.F. Collins (1993), MPsrch version 1.3, Biocomputing Research Unit, University of Edinburgh, UK.

    Google Scholar 

  31. E. Ukkonen, Finding Approximate Patterns in Strings, J. Algorithms 6(1985), pp. 132–137.

    Google Scholar 

  32. M. Vingron and M.S. Waterman, Parametric Sequence Alignments and Penalty Choice: Case Studies, manuscript, 1993.

    Google Scholar 

  33. M.S. Waterman, Sequence Alignments, in M.S. Waterman, ed., Mathematical Methods for DNA Sequences, CRC Press (1989), pp. 53–92.

    Google Scholar 

  34. M.S. Waterman and M. Eggert, A new algorithm for best subsequence alignments with applicaiton to tRNA-rRNA comparison, J. Mol. Biol. 197(1987), pp. 723–728.

    Google Scholar 

  35. S. Wu and U. Manber, Fast Text Searching Allowing Errors, Comm. ACM 35(1992), pp. 83–91.

    Google Scholar 

  36. S. Wu, U. Manber, and E.W. Myers, A Sub-quadratic Algorithm for Approximate Limited Expression Matching, Technical Report TR92-36, Computer Science Dept., University of Arizona, Tucson, December 1992.

    Google Scholar 

  37. A.C. Yao, The Complexity of Pattern Matching for a Random String, SIAM J. Comput. 8(1979), pp. 368–387.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Maxime Crochemore Dan Gusfield

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chang, W.I., Marr, T.G. (1994). Approximate string matching and local similarity. In: Crochemore, M., Gusfield, D. (eds) Combinatorial Pattern Matching. CPM 1994. Lecture Notes in Computer Science, vol 807. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58094-8_23

Download citation

  • DOI: https://doi.org/10.1007/3-540-58094-8_23

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58094-2

  • Online ISBN: 978-3-540-48450-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics