Abstract
Given a text string of lengthn and a pattern string of lengthm over ab-letter alphabet, thek differences approximate string matching problem asks for all locations in the text where the pattern occurs with at mostk differences (substitutions, insertions, deletions). We treatk not as a constant but as a fraction ofm (not necessarily constant-fraction). Previous algorithms require at leastO(kn) time (or exponential space). We give an algorithm that is sublinear time0((n/m)k log b m) when the text is random andk is bounded by the threshold m/(logb m + O(1)). In particular, whenk=o(m/logb m) the expected running time iso(n). In the worst case our algorithm is O(kn), but is still an improvement in that it is practical and uses0(m) space compared with0(n) or0(m 2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.
Similar content being viewed by others
References
A. V. Aho and M. J. Corasick, Efficient String Matching: An Aid to Bibliographic Search,Comm. ACM 18 (1975), 333–340.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, A Basic Local Alignment Search Tool,J. Molecular Biology 215 (1990), 403–410.
A. Apostolico, The Myriad Virtues of Subword Trees, in A. Apostolico and Z. Galil, eds.,Combinatorial Algorithms on Words, NATO ASI Series F, Vol. 12, Springer-Verlag, New York, 1985, pp. 85–96.
W. I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, Computer program, 1990.
W. I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, University of California, Berkeley, August 1991. Also available as Computer Science Division Reports UCB/CSD 91/653-654.
W. I. Chang, Approximate String Matching and Local Similarity,Proc. Fifth Annual Symposium on Combinatorial Pattern Matching, Asilomar, CA, June 5–8, 1994, Lecture Notes in Computer Science, Springer-Verlag, Berlin, in press.
W. I. Chang and J. Lampe, Theoretical and Empirical Comparisons of Approximate String Matching Algorithms,Proc. Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, April 29–May 1, 1992, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, 1992, pp. 175–184.
W. I. Chang and E. L. Lawler, Approximate String Matching in Sublinear Expected Time,Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, Oct. 22–24, 1990, pp. 116–124.
W. I. Chang and E. L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster),Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo, Sequence Landscapes,Nucleic Acids Res. 14(1) (1986), 141–158.
M. Crochemore, Longest Common Factor of Two Words,Proc. TAPSOFT '87, Lecture Notes in Computer Science, Vol. 249, Springer-Verlag, Berlin, 1988, pp. 26–36.
R. F. Doolittle, ed.Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods in Enzymology, Volume 183, Academic Press, New York, 1990.
E. R. Fiala and D. H. Greene, Data Compression with Finite Windows,Comm. ACM 32(4) (1989), 490–505.
Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching,J. Complexity 4 (1988), 33–72.
Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching,SIAM J. Comput. 19(6) (1990), 989–999.
G. H. Gonnet and R. Baeza-Yates,Handbook of Algorithms and Data Structures: in Pascal and C, 2nd edn., Addison-Wesely, Reading, MA, 1991.
D. Gusfield,Efficient Algorithms for String Manipulation and Pattern Matching, Lecture Notes, University of California, Davis, 1989.
D. Gusfield, K. Balasubramanian, and D. Naor, Parametric Optimization of Sequence Alignment,Proc. Third Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 1992, pp. 432–439.
D. Gusfield, G. M. Landau, and B. Schieber, An Efficient Algorithm for the All Pairs Suffix-Prefix Problem,Proc. Sequences 91, Italy, July 1991.
X. Huang, A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps,Genomics 14(1) (1992), 18–25.
L. C. Hui, Color Set Size Problem with Applications to String Matching,Proc. Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, April 29–May 1, 1992, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, pp. 230–243.
P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, Manuscript, 1990,
S. Kannan and T. Warnow, Inferring Evolutionary History from DNA Sequences,Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 362–371.
S. Karlin, F. Ost, and B. E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M. S. Waterman, ed.,Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, FL, 1989, pp. 133–157.
R. M. Karp,Probabilistic Analysis of Algorithms, Lecture notes, University of California, Berkeley, Spring 1988; Fall 1989.
R. M. Karp and M. O. Rabin, Efficient Randomized Pattern-Matching Algorithms,IBM J. Res. Develop 31 (1987), 249–260.
J. D. Kececioglu, Exact and Approximate Algorithms for DNA Sequence Reconstruction, Ph.D. thesis, University of Arizona, Tucson, 1991. Also available as Technical Report TR91-26, Computer Science Department, University of Arizona, Tucson.
D. E. Knuth, J. H. Morris, and V. R. Pratt, Fast Pattern Matching in Strings,SIAM J. Comput. 6(2) (1977), 323–350.
G. M. Landau and U. Vishkin, Fast String Matching withk Differences,J. Comp. System Sci. 37 (1988), 63–78.
G. M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching,J. Algorithms 10 (1989), 157–169.
V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals,Soviet Phys. Dokl. 6 (1966), 126–136.
E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm,J. Assoc. Comput. Mach. 23(2) (1976), 262–272.
E. W. Myers, A Sublinear Algorithm for Approximate Keyword Matching, Technical Report TR90-25, Computer Science Department, University of Arizona, Tucson, September 1991.
National Center for Human Genome Research,Understanding Our Genetic Inheritance (The U.S. Human Genome Project: The First Five Years FY 1991–1995), NIH Publication No. 90-1580, April 1990.
K. Park, Fast String Matching On the Average, Manuscript, 1990.
W. R. Pearson and D. J. Lipman, Improved tools for biological sequence comparison,Proc. Nat. Acad. Sci. USA 85 (1988), 2444–2448.
H. Peltola, H. Söderlund, and E. Ukkonen, SEQAID: A DNA Sequence Assembling Program Based on a Mathematical Model,Nucleic Acids Res. 12(1) (1984), 307–321.
M. Rodeh, V. R. Pratt, and S. Even, Linear Algorithms for Data Compression via String Matching,J. Assoc. Comput. Mach. 28(1) (1981), 16–24.
D. Sankoff and J. B. Kruskal, eds.,Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983.
B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization,SIAM J. Comput. 17(6) (1988), 1253–1262.
P. H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition,J. Algorithms 1 (1980), 359–373.
E. Ukkonen, Finding Approximate Patterns in Strings,J. Algorithms 6 (1985), 132–137.
E. Ukkonen, Personal communications.
E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Department of Computer Science, University of Helsinki, April 1990.
M. S. Waterman, Sequence Alignments, in M. S. Waterman, ed.,Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, FL, 1989, pp. 53–92.
M. S. Waterman, M. Eggert, and E. Lander, Parametric Sequence Comparisons,Proc. Nat. Acad. Sci. USA 89 (1992), 6090–6093.
P. Weiner, Linear Pattern Matching Algorithms,Proc. IEEE Symposium on Switching and Automata Theory, 1973, pp. 1–11.
S. Wu, U. Manber, and E. Myers, Improving the Running Times for Some String Matching Problems, Technical Report TR91-20, Computer Science Department, University of Arizona, Tucson, August 1991.
A. C. Yao, The Complexity of Pattern Matching for a Random String,SIAM J. Comput. 8 (1979), 368–387.
Author information
Authors and Affiliations
Additional information
Communicated by Alberto Apostolico.
This work was supported in part by NSF Grants CCR-87-04184 and FD-89-02813; by the Human Genome Center, Lawrence Berkeley Laboratory, supported by the Director, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098; and by Department of Energy Grants DE-FG03-90ER60999 and DE-FG02-91ER61190. Earlier versions of this paper appeared as [8] and part of [5].
Rights and permissions
About this article
Cite this article
Chang, W.I., Lawler, E.L. Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994). https://doi.org/10.1007/BF01185431
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF01185431