Sublinear approximate string matching and biological applications

Chang, W. I.; Lawler, E. L.

doi:10.1007/BF01185431

Sublinear approximate string matching and biological applications

Published: November 1994

Volume 12, pages 327–344, (1994)
Cite this article

Algorithmica Aims and scope Submit manuscript

W. I. Chang¹ &
E. L. Lawler²

481 Accesses
103 Citations
3 Altmetric
Explore all metrics

Abstract

Given a text string of lengthn and a pattern string of lengthm over ab-letter alphabet, thek differences approximate string matching problem asks for all locations in the text where the pattern occurs with at mostk differences (substitutions, insertions, deletions). We treatk not as a constant but as a fraction ofm (not necessarily constant-fraction). Previous algorithms require at leastO(kn) time (or exponential space). We give an algorithm that is sublinear time0((n/m)k log _b m) when the text is random andk is bounded by the threshold m/(log_b m + O(1)). In particular, whenk=o(m/log_b m) the expected running time iso(n). In the worst case our algorithm is O(kn), but is still an improvement in that it is practical and uses0(m) space compared with0(n) or0(m ²). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

libFLASM: a software library for fixed-length approximate string matching

Article Open access 10 November 2016

Consequences of Faster Alignment of Sequences

Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis

References

A. V. Aho and M. J. Corasick, Efficient String Matching: An Aid to Bibliographic Search,Comm. ACM 18 (1975), 333–340.
Article MATH MathSciNet Google Scholar
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, A Basic Local Alignment Search Tool,J. Molecular Biology 215 (1990), 403–410.
Google Scholar
A. Apostolico, The Myriad Virtues of Subword Trees, in A. Apostolico and Z. Galil, eds.,Combinatorial Algorithms on Words, NATO ASI Series F, Vol. 12, Springer-Verlag, New York, 1985, pp. 85–96.
Google Scholar
W. I. Chang, Fast Implementation of the Schieber-Vishkin Lowest Common Ancestor Algorithm, Computer program, 1990.
W. I. Chang, Approximate Pattern Matching and Biological Applications, Ph.D. thesis, University of California, Berkeley, August 1991. Also available as Computer Science Division Reports UCB/CSD 91/653-654.
Google Scholar
W. I. Chang, Approximate String Matching and Local Similarity,Proc. Fifth Annual Symposium on Combinatorial Pattern Matching, Asilomar, CA, June 5–8, 1994, Lecture Notes in Computer Science, Springer-Verlag, Berlin, in press.
Google Scholar
W. I. Chang and J. Lampe, Theoretical and Empirical Comparisons of Approximate String Matching Algorithms,Proc. Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, April 29–May 1, 1992, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, 1992, pp. 175–184.
Google Scholar
W. I. Chang and E. L. Lawler, Approximate String Matching in Sublinear Expected Time,Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, Oct. 22–24, 1990, pp. 116–124.
W. I. Chang and E. L. Lawler, Approximate String Matching and Biological Sequence Analysis (poster),Human Genome II Official Program and Abstracts, San Diego, CA, Oct. 22–24, 1990, p. 24.
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo, Sequence Landscapes,Nucleic Acids Res. 14(1) (1986), 141–158.
Article Google Scholar
M. Crochemore, Longest Common Factor of Two Words,Proc. TAPSOFT '87, Lecture Notes in Computer Science, Vol. 249, Springer-Verlag, Berlin, 1988, pp. 26–36.
Google Scholar
R. F. Doolittle, ed.Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods in Enzymology, Volume 183, Academic Press, New York, 1990.
Google Scholar
E. R. Fiala and D. H. Greene, Data Compression with Finite Windows,Comm. ACM 32(4) (1989), 490–505.
Article Google Scholar
Z. Galil and R. Giancarlo, Data Structures and Algorithms for Approximate String Matching,J. Complexity 4 (1988), 33–72.
Article MATH MathSciNet Google Scholar
Z. Galil and K. Park, An Improved Algorithm for Approximate String Matching,SIAM J. Comput. 19(6) (1990), 989–999.
Article MATH MathSciNet Google Scholar
G. H. Gonnet and R. Baeza-Yates,Handbook of Algorithms and Data Structures: in Pascal and C, 2nd edn., Addison-Wesely, Reading, MA, 1991.
Google Scholar
D. Gusfield,Efficient Algorithms for String Manipulation and Pattern Matching, Lecture Notes, University of California, Davis, 1989.
Google Scholar
D. Gusfield, K. Balasubramanian, and D. Naor, Parametric Optimization of Sequence Alignment,Proc. Third Annual ACM-SIAM Symposium on Discrete Algorithms, Jan. 1992, pp. 432–439.
D. Gusfield, G. M. Landau, and B. Schieber, An Efficient Algorithm for the All Pairs Suffix-Prefix Problem,Proc. Sequences 91, Italy, July 1991.
X. Huang, A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps,Genomics 14(1) (1992), 18–25.
Article Google Scholar
L. C. Hui, Color Set Size Problem with Applications to String Matching,Proc. Third Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, April 29–May 1, 1992, Lecture Notes in Computer Science, Vol. 644, Springer-Verlag, Berlin, pp. 230–243.
Google Scholar
P. Jokinen, J. Tarhio, and E. Ukkonen, A Comparison of Approximate String Matching Algorithms, Manuscript, 1990,
S. Kannan and T. Warnow, Inferring Evolutionary History from DNA Sequences,Proc. 31st Annual IEEE Symposium on Foundations of Computer Science, St. Louis, MO, October 1990, pp. 362–371.
S. Karlin, F. Ost, and B. E. Blaisdell, Patterns in DNA and Amino Acid Sequences and Their Statistical Significance, in M. S. Waterman, ed.,Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, FL, 1989, pp. 133–157.
Google Scholar
R. M. Karp,Probabilistic Analysis of Algorithms, Lecture notes, University of California, Berkeley, Spring 1988; Fall 1989.
Google Scholar
R. M. Karp and M. O. Rabin, Efficient Randomized Pattern-Matching Algorithms,IBM J. Res. Develop 31 (1987), 249–260.
Article MATH MathSciNet Google Scholar
J. D. Kececioglu, Exact and Approximate Algorithms for DNA Sequence Reconstruction, Ph.D. thesis, University of Arizona, Tucson, 1991. Also available as Technical Report TR91-26, Computer Science Department, University of Arizona, Tucson.
Google Scholar
D. E. Knuth, J. H. Morris, and V. R. Pratt, Fast Pattern Matching in Strings,SIAM J. Comput. 6(2) (1977), 323–350.
Article MATH MathSciNet Google Scholar
G. M. Landau and U. Vishkin, Fast String Matching withk Differences,J. Comp. System Sci. 37 (1988), 63–78.
Article MATH MathSciNet Google Scholar
G. M. Landau and U. Vishkin, Fast Parallel and Serial Approximate String Matching,J. Algorithms 10 (1989), 157–169.
Article MATH MathSciNet Google Scholar
V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals,Soviet Phys. Dokl. 6 (1966), 126–136.
Google Scholar
E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm,J. Assoc. Comput. Mach. 23(2) (1976), 262–272.
MATH MathSciNet Google Scholar
E. W. Myers, A Sublinear Algorithm for Approximate Keyword Matching, Technical Report TR90-25, Computer Science Department, University of Arizona, Tucson, September 1991.
Google Scholar
National Center for Human Genome Research,Understanding Our Genetic Inheritance (The U.S. Human Genome Project: The First Five Years FY 1991–1995), NIH Publication No. 90-1580, April 1990.
K. Park, Fast String Matching On the Average, Manuscript, 1990.
W. R. Pearson and D. J. Lipman, Improved tools for biological sequence comparison,Proc. Nat. Acad. Sci. USA 85 (1988), 2444–2448.
Article Google Scholar
H. Peltola, H. Söderlund, and E. Ukkonen, SEQAID: A DNA Sequence Assembling Program Based on a Mathematical Model,Nucleic Acids Res. 12(1) (1984), 307–321.
Article Google Scholar
M. Rodeh, V. R. Pratt, and S. Even, Linear Algorithms for Data Compression via String Matching,J. Assoc. Comput. Mach. 28(1) (1981), 16–24.
MATH MathSciNet Google Scholar
D. Sankoff and J. B. Kruskal, eds.,Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, 1983.
Google Scholar
B. Schieber and U. Vishkin, On Finding Lowest Common Ancestors: Simplification and Parallelization,SIAM J. Comput. 17(6) (1988), 1253–1262.
Article MATH MathSciNet Google Scholar
P. H. Sellers, The Theory and Computation of Evolutionary Distances: Pattern Recognition,J. Algorithms 1 (1980), 359–373.
Article MATH MathSciNet Google Scholar
E. Ukkonen, Finding Approximate Patterns in Strings,J. Algorithms 6 (1985), 132–137.
Article MATH MathSciNet Google Scholar
E. Ukkonen, Personal communications.
E. Ukkonen and D. Wood, Approximate String Matching with Suffix Automata, Report A-1990-4, Department of Computer Science, University of Helsinki, April 1990.
M. S. Waterman, Sequence Alignments, in M. S. Waterman, ed.,Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, FL, 1989, pp. 53–92.
Google Scholar
M. S. Waterman, M. Eggert, and E. Lander, Parametric Sequence Comparisons,Proc. Nat. Acad. Sci. USA 89 (1992), 6090–6093.
Article Google Scholar
P. Weiner, Linear Pattern Matching Algorithms,Proc. IEEE Symposium on Switching and Automata Theory, 1973, pp. 1–11.
S. Wu, U. Manber, and E. Myers, Improving the Running Times for Some String Matching Problems, Technical Report TR91-20, Computer Science Department, University of Arizona, Tucson, August 1991.
Google Scholar
A. C. Yao, The Complexity of Pattern Matching for a Random String,SIAM J. Comput. 8 (1979), 368–387.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Cold Spring Harbor Laboratory, P.O. Box 100, 11724, Cold Spring Harbor, NY, USA
W. I. Chang
Computer Science Division, University of California, 94720, Berkeley, CA, USA
E. L. Lawler

Authors

W. I. Chang
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Lawler
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Communicated by Alberto Apostolico.

This work was supported in part by NSF Grants CCR-87-04184 and FD-89-02813; by the Human Genome Center, Lawrence Berkeley Laboratory, supported by the Director, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098; and by Department of Energy Grants DE-FG03-90ER60999 and DE-FG02-91ER61190. Earlier versions of this paper appeared as [8] and part of [5].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, W.I., Lawler, E.L. Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994). https://doi.org/10.1007/BF01185431

Download citation

Received: 05 September 1991
Revised: 02 September 1992
Issue Date: November 1994
DOI: https://doi.org/10.1007/BF01185431

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sublinear approximate string matching and biological applications

Abstract

Access this article

Similar content being viewed by others

libFLASM: a software library for fixed-length approximate string matching

Consequences of Faster Alignment of Sequences

Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Sublinear approximate string matching and biological applications

Abstract

Access this article

Similar content being viewed by others

libFLASM: a software library for fixed-length approximate string matching

Consequences of Faster Alignment of Sequences

Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation