Skip to main content
Log in

Sublinear approximate string matching and biological applications

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Given a text string of lengthn and a pattern string of lengthm over ab-letter alphabet, thek differences approximate string matching problem asks for all locations in the text where the pattern occurs with at mostk differences (substitutions, insertions, deletions). We treatk not as a constant but as a fraction ofm (not necessarily constant-fraction). Previous algorithms require at leastO(kn) time (or exponential space). We give an algorithm that is sublinear time0((n/m)k log b m) when the text is random andk is bounded by the threshold m/(logb m + O(1)). In particular, whenk=o(m/logb m) the expected running time iso(n). In the worst case our algorithm is O(kn), but is still an improvement in that it is practical and uses0(m) space compared with0(n) or0(m 2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. A. V. Aho and M. J. Corasick, Efficient String Matching: An Aid to Bibliographic Search,Comm. ACM 18 (1975), 333–340.

    Article  MATH  MathSciNet  Google Scholar 

  2. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, A Basic Local Alignment Search Tool,J. Molecular Biology 215 (1990), 403–410.

    Google Scholar 

  3. A. Apostolico, The Myriad Virtues of Subword Trees, in A. Apostolico and Z. Galil, eds.,Combinatorial Algorithms on Words, NATO ASI Series F, Vol. 12, Springer-Verlag, New York, 1985, pp. 85–96.

    Google Scholar 

  4. W. I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, Computer program, 1990.

  5. W. I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, University of California, Berkeley, August 1991. Also available as Computer Science Division Reports UCB/CSD 91/653-654.

    Google Scholar 

  6. W. I. Chang, Approximate String Matching and Local Similarity,Proc. Fifth Annual Symposium on Combinatorial Pattern Matching, Asilomar, CA, June 5–8, 1994, Lecture Notes in Computer Science, Springer-Verlag, Berlin, in press.

    Google Scholar 

  7. W. I. Chang and J. Lampe, Theoretical and Empirical Comparisons of Approximate String Matching Algorithms,Proc. Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, April 29–May 1, 1992, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, 1992, pp. 175–184.

    Google Scholar 

  8. W. I. Chang and E. L. Lawler, Approximate String Matching in Sublinear Expected Time,Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, Oct. 22–24, 1990, pp. 116–124.

  9. W. I. Chang and E. L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster),Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.

  10. B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo, Sequence Landscapes,Nucleic Acids Res. 14(1) (1986), 141–158.

    Article  Google Scholar 

  11. M. Crochemore, Longest Common Factor of Two Words,Proc. TAPSOFT '87, Lecture Notes in Computer Science, Vol. 249, Springer-Verlag, Berlin, 1988, pp. 26–36.

    Google Scholar 

  12. R. F. Doolittle, ed.Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods in Enzymology, Volume 183, Academic Press, New York, 1990.

    Google Scholar 

  13. E. R. Fiala and D. H. Greene, Data Compression with Finite Windows,Comm. ACM 32(4) (1989), 490–505.

    Article  Google Scholar 

  14. Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching,J. Complexity 4 (1988), 33–72.

    Article  MATH  MathSciNet  Google Scholar 

  15. Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching,SIAM J. Comput. 19(6) (1990), 989–999.

    Article  MATH  MathSciNet  Google Scholar 

  16. G. H. Gonnet and R. Baeza-Yates,Handbook of Algorithms and Data Structures: in Pascal and C, 2nd edn., Addison-Wesely, Reading, MA, 1991.

    Google Scholar 

  17. D. Gusfield,Efficient Algorithms for String Manipulation and Pattern Matching, Lecture Notes, University of California, Davis, 1989.

    Google Scholar 

  18. D. Gusfield, K. Balasubramanian, and D. Naor, Parametric Optimization of Sequence Alignment,Proc. Third Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 1992, pp. 432–439.

  19. D. Gusfield, G. M. Landau, and B. Schieber, An Efficient Algorithm for the All Pairs Suffix-Prefix Problem,Proc. Sequences 91, Italy, July 1991.

  20. X. Huang, A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps,Genomics 14(1) (1992), 18–25.

    Article  Google Scholar 

  21. L. C. Hui, Color Set Size Problem with Applications to String Matching,Proc. Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, April 29–May 1, 1992, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, pp. 230–243.

    Google Scholar 

  22. P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, Manuscript, 1990,

  23. S. Kannan and T. Warnow, Inferring Evolutionary History from DNA Sequences,Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 362–371.

  24. S. Karlin, F. Ost, and B. E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M. S. Waterman, ed.,Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, FL, 1989, pp. 133–157.

    Google Scholar 

  25. R. M. Karp,Probabilistic Analysis of Algorithms, Lecture notes, University of California, Berkeley, Spring 1988; Fall 1989.

    Google Scholar 

  26. R. M. Karp and M. O. Rabin, Efficient Randomized Pattern-Matching Algorithms,IBM J. Res. Develop 31 (1987), 249–260.

    Article  MATH  MathSciNet  Google Scholar 

  27. J. D. Kececioglu, Exact and Approximate Algorithms for DNA Sequence Reconstruction, Ph.D. thesis, University of Arizona, Tucson, 1991. Also available as Technical Report TR91-26, Computer Science Department, University of Arizona, Tucson.

    Google Scholar 

  28. D. E. Knuth, J. H. Morris, and V. R. Pratt, Fast Pattern Matching in Strings,SIAM J. Comput. 6(2) (1977), 323–350.

    Article  MATH  MathSciNet  Google Scholar 

  29. G. M. Landau and U. Vishkin, Fast String Matching withk Differences,J. Comp. System Sci. 37 (1988), 63–78.

    Article  MATH  MathSciNet  Google Scholar 

  30. G. M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching,J. Algorithms 10 (1989), 157–169.

    Article  MATH  MathSciNet  Google Scholar 

  31. V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals,Soviet Phys. Dokl. 6 (1966), 126–136.

    Google Scholar 

  32. E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm,J. Assoc. Comput. Mach. 23(2) (1976), 262–272.

    MATH  MathSciNet  Google Scholar 

  33. E. W. Myers, A Sublinear Algorithm for Approximate Keyword Matching, Technical Report TR90-25, Computer Science Department, University of Arizona, Tucson, September 1991.

    Google Scholar 

  34. National Center for Human Genome Research,Understanding Our Genetic Inheritance (The U.S. Human Genome Project: The First Five Years FY 1991–1995), NIH Publication No. 90-1580, April 1990.

  35. K. Park, Fast String Matching On the Average, Manuscript, 1990.

  36. W. R. Pearson and D. J. Lipman, Improved tools for biological sequence comparison,Proc. Nat. Acad. Sci. USA 85 (1988), 2444–2448.

    Article  Google Scholar 

  37. H. Peltola, H. Söderlund, and E. Ukkonen, SEQAID: A DNA Sequence Assembling Program Based on a Mathematical Model,Nucleic Acids Res. 12(1) (1984), 307–321.

    Article  Google Scholar 

  38. M. Rodeh, V. R. Pratt, and S. Even, Linear Algorithms for Data Compression via String Matching,J. Assoc. Comput. Mach. 28(1) (1981), 16–24.

    MATH  MathSciNet  Google Scholar 

  39. D. Sankoff and J. B. Kruskal, eds.,Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983.

    Google Scholar 

  40. B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization,SIAM J. Comput. 17(6) (1988), 1253–1262.

    Article  MATH  MathSciNet  Google Scholar 

  41. P. H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition,J. Algorithms 1 (1980), 359–373.

    Article  MATH  MathSciNet  Google Scholar 

  42. E. Ukkonen, Finding Approximate Patterns in Strings,J. Algorithms 6 (1985), 132–137.

    Article  MATH  MathSciNet  Google Scholar 

  43. E. Ukkonen, Personal communications.

  44. E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Department of Computer Science, University of Helsinki, April 1990.

  45. M. S. Waterman, Sequence Alignments, in M. S. Waterman, ed.,Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, FL, 1989, pp. 53–92.

    Google Scholar 

  46. M. S. Waterman, M. Eggert, and E. Lander, Parametric Sequence Comparisons,Proc. Nat. Acad. Sci. USA 89 (1992), 6090–6093.

    Article  Google Scholar 

  47. P. Weiner, Linear Pattern Matching Algorithms,Proc. IEEE Symposium on Switching and Automata Theory, 1973, pp. 1–11.

  48. S. Wu, U. Manber, and E. Myers, Improving the Running Times for Some String Matching Problems, Technical Report TR91-20, Computer Science Department, University of Arizona, Tucson, August 1991.

    Google Scholar 

  49. A. C. Yao, The Complexity of Pattern Matching for a Random String,SIAM J. Comput. 8 (1979), 368–387.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Communicated by Alberto Apostolico.

This work was supported in part by NSF Grants CCR-87-04184 and FD-89-02813; by the Human Genome Center, Lawrence Berkeley Laboratory, supported by the Director, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098; and by Department of Energy Grants DE-FG03-90ER60999 and DE-FG02-91ER61190. Earlier versions of this paper appeared as [8] and part of [5].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, W.I., Lawler, E.L. Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994). https://doi.org/10.1007/BF01185431

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01185431

Key words

Navigation