A fast filtration algorithm for the substring matching problem

Pevzner, Pavel A.; Waterman, Michael S.

doi:10.1007/BFb0029806

Pavel A. Pevzner¹ &
Michael S. Waterman²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 684))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

145 Accesses
5 Citations

Abstract

Given a text of length n and a query of length q we present an algorithm for finding all locations of m-tuples in the text and in the query that differ by at most K mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple filtration which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar m-tuples. The second stage compares these m-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

The research was supported in part by the National Science Foundation (DMS 90-05833) and the National Institute of Health (GM-36230). This paper was written when P.A.P. was at the Department of Mathematics, University of Southern California.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates R.A., Gonnet G.H. A new approach to text searching. in Proc. of the 12th Annual ACM-SIGIR conference on Information Retrieval, Cambridge, MA, (1989), 168–175
Google Scholar
Baeza-Yates R.A., Perleberg C.H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, U. Manber (eds.) Combinatorial Pattern Matching 92, Tucson, Arizona, Lecture Notes in Computer Science, 644, Springer-Verlag, (1992), 185–192
Google Scholar
Blaisdell B.E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Nat. Acad. Sci. U.S.A., 83, (1986), 5155–5159.
Google Scholar
Chang W.I., Lawler E.L. Approximate string matching in sublinear expected time. Proceedings of 31st IEEE FOCS, (1990), 116–124
Google Scholar
Danckaert A., Mugnier C., Dessen P., and Cohen-Solal M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes. CABIOS, 3, (1987) 303–307.
Google Scholar
Dumas, J.P., Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucl. Acids Res., 10, (1982), 197–206.
Google Scholar
Feller W. An introduction to probability theory and its applications. John Wiley & Sons, New York, (1970)
Google Scholar
Galil, Z. and Giancarlo, R. Improved string matching with k mismatches. SIGACT News, April, (1986), 52–54.
Google Scholar
Grossi R., Luccio F. Simple and efficient string matching with k mismatches. Information Processing Letters, 33, (1990), 113–120
Google Scholar
Harrison M.C. Implementation of the substring test by hashing. C.ACM, 14, (1971), 777–779
Google Scholar
Hume A., Sunday D. Fast string searching. Software — Practice and Experience, 21, (1991), 1221–1248
Google Scholar
Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms. IBM J. Res. Develop., 31, (1987), 249–260
Google Scholar
Kim J.Y. Shawe-Taylor J. An approximate string matching algorithm. Theoretical Computer Science, 92, (1992), 107–117
Google Scholar
Knuth D.E. The art of computer programming, vol.III: sorting and searching. Addison-Wesley, Reading, Mass., (1973)
Google Scholar
Landau G.M., Vishkin U. Efficient string matching with k mismatches, Theoret. Computer Sci., 43, (1986), 239–249
Google Scholar
Landau G.M., Vishkin U. Fast parallel and serial approximate string matching. J. of Algorithms, 10, (1989), 157–169
Google Scholar
Landau, G.M., Vishkin, U., and Nussinov, R. Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS, 4, (1988), 19–24.
Google Scholar
Lipman, D.J., Pearson, W.R. Rapid and sensitive protein similarity searches. Science, 227, (1985), 1435–1441.
Google Scholar
Maizel, J. V.,Jr. and Lenk, R.P. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc. Nat. Acad. Sci. USA, 78, (1981), 7665–7669.
Google Scholar
Myers E.W., Mount D. (1986) Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences. Nucleic Acids Research, 14, 501–508
Google Scholar
Myers E.W. (1990) A sublinear algorithm for approximate keyword searching. Technical Report TR-90-25, Department of Computer Science, The University of Arizona, Tucson, Arizona. (to appear in Algorithmica)
Google Scholar
Owolabi O., McGregor D.R. Fast approximate string matching. Software-Practice and Experience, 18, (1988), 387–393
Google Scholar
Tarhio J., Ukkonen E. Boyer-Moore approach to approximate string matching Lecture Notes in Computer Science, 447, Springer, Berlin, (1990), 348–359
Google Scholar
Ukkonen U. Finding approximate patterns in strings. Journal of Algorithms, 6, (1985), 132–137
Google Scholar
Ukkonen U. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92, (1992), 191–211
Google Scholar
Wilbur W. J., Lipman D.J., Rapid similarity searches of nucleic acid and protein data banks. Proc. Nat. Acad. Sci. USA, 80, (1983), 726–730.
Google Scholar
Wu S., Manber U. Agrep — A Fast Approximate Pattern-Matching Tool. Usenix Winter 1992 Technical Conference, San Francisco (January 1992), (1992), 153–162.
Google Scholar
Wu S., Manber U. Fast Text Searching Allowing Errors. Comm. of the ACM, 35, No.10 (1992),83–90
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, The Pennsylvania State University, 16802, University Park, PA
Pavel A. Pevzner
Departments of Mathematics and of Molecular Biology, University of Southern California, 90089-1113, Los Angeles, California
Michael S. Waterman

Authors

Pavel A. Pevzner
View author publications
You can also search for this author in PubMed Google Scholar
Michael S. Waterman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pevzner, P.A., Waterman, M.S. (1993). A fast filtration algorithm for the substring matching problem. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science, vol 684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0029806

Download citation

DOI: https://doi.org/10.1007/BFb0029806
Published: 17 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56764-6
Online ISBN: 978-3-540-47732-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics