Abstract
Given a text of length n and a query of length q we present an algorithm for finding all locations of m-tuples in the text and in the query that differ by at most K mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple filtration which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar m-tuples. The second stage compares these m-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.
The research was supported in part by the National Science Foundation (DMS 90-05833) and the National Institute of Health (GM-36230). This paper was written when P.A.P. was at the Department of Mathematics, University of Southern California.
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates R.A., Gonnet G.H. A new approach to text searching. in Proc. of the 12th Annual ACM-SIGIR conference on Information Retrieval, Cambridge, MA, (1989), 168–175
Baeza-Yates R.A., Perleberg C.H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, U. Manber (eds.) Combinatorial Pattern Matching 92, Tucson, Arizona, Lecture Notes in Computer Science, 644, Springer-Verlag, (1992), 185–192
Blaisdell B.E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Nat. Acad. Sci. U.S.A., 83, (1986), 5155–5159.
Chang W.I., Lawler E.L. Approximate string matching in sublinear expected time. Proceedings of 31st IEEE FOCS, (1990), 116–124
Danckaert A., Mugnier C., Dessen P., and Cohen-Solal M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes. CABIOS, 3, (1987) 303–307.
Dumas, J.P., Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucl. Acids Res., 10, (1982), 197–206.
Feller W. An introduction to probability theory and its applications. John Wiley & Sons, New York, (1970)
Galil, Z. and Giancarlo, R. Improved string matching with k mismatches. SIGACT News, April, (1986), 52–54.
Grossi R., Luccio F. Simple and efficient string matching with k mismatches. Information Processing Letters, 33, (1990), 113–120
Harrison M.C. Implementation of the substring test by hashing. C.ACM, 14, (1971), 777–779
Hume A., Sunday D. Fast string searching. Software — Practice and Experience, 21, (1991), 1221–1248
Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms. IBM J. Res. Develop., 31, (1987), 249–260
Kim J.Y. Shawe-Taylor J. An approximate string matching algorithm. Theoretical Computer Science, 92, (1992), 107–117
Knuth D.E. The art of computer programming, vol.III: sorting and searching. Addison-Wesley, Reading, Mass., (1973)
Landau G.M., Vishkin U. Efficient string matching with k mismatches, Theoret. Computer Sci., 43, (1986), 239–249
Landau G.M., Vishkin U. Fast parallel and serial approximate string matching. J. of Algorithms, 10, (1989), 157–169
Landau, G.M., Vishkin, U., and Nussinov, R. Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS, 4, (1988), 19–24.
Lipman, D.J., Pearson, W.R. Rapid and sensitive protein similarity searches. Science, 227, (1985), 1435–1441.
Maizel, J. V.,Jr. and Lenk, R.P. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc. Nat. Acad. Sci. USA, 78, (1981), 7665–7669.
Myers E.W., Mount D. (1986) Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences. Nucleic Acids Research, 14, 501–508
Myers E.W. (1990) A sublinear algorithm for approximate keyword searching. Technical Report TR-90-25, Department of Computer Science, The University of Arizona, Tucson, Arizona. (to appear in Algorithmica)
Owolabi O., McGregor D.R. Fast approximate string matching. Software-Practice and Experience, 18, (1988), 387–393
Tarhio J., Ukkonen E. Boyer-Moore approach to approximate string matching Lecture Notes in Computer Science, 447, Springer, Berlin, (1990), 348–359
Ukkonen U. Finding approximate patterns in strings. Journal of Algorithms, 6, (1985), 132–137
Ukkonen U. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92, (1992), 191–211
Wilbur W. J., Lipman D.J., Rapid similarity searches of nucleic acid and protein data banks. Proc. Nat. Acad. Sci. USA, 80, (1983), 726–730.
Wu S., Manber U. Agrep — A Fast Approximate Pattern-Matching Tool. Usenix Winter 1992 Technical Conference, San Francisco (January 1992), (1992), 153–162.
Wu S., Manber U. Fast Text Searching Allowing Errors. Comm. of the ACM, 35, No.10 (1992),83–90
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1993 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pevzner, P.A., Waterman, M.S. (1993). A fast filtration algorithm for the substring matching problem. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science, vol 684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0029806
Download citation
DOI: https://doi.org/10.1007/BFb0029806
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56764-6
Online ISBN: 978-3-540-47732-7
eBook Packages: Springer Book Archive