Skip to main content

A fast filtration algorithm for the substring matching problem

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1993)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 684))

Included in the following conference series:

Abstract

Given a text of length n and a query of length q we present an algorithm for finding all locations of m-tuples in the text and in the query that differ by at most K mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the case q = m the problem coincides with the classical approximate string matching with k mismatches problem. We present a new approach to this problem based on multiple filtration which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similar m-tuples. The second stage compares these m-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

The research was supported in part by the National Science Foundation (DMS 90-05833) and the National Institute of Health (GM-36230). This paper was written when P.A.P. was at the Department of Mathematics, University of Southern California.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates R.A., Gonnet G.H. A new approach to text searching. in Proc. of the 12th Annual ACM-SIGIR conference on Information Retrieval, Cambridge, MA, (1989), 168–175

    Google Scholar 

  2. Baeza-Yates R.A., Perleberg C.H. Fast and practical approximate string matching. In A. Apostolico, M. Crochermore, Z. Galil, U. Manber (eds.) Combinatorial Pattern Matching 92, Tucson, Arizona, Lecture Notes in Computer Science, 644, Springer-Verlag, (1992), 185–192

    Google Scholar 

  3. Blaisdell B.E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Nat. Acad. Sci. U.S.A., 83, (1986), 5155–5159.

    Google Scholar 

  4. Chang W.I., Lawler E.L. Approximate string matching in sublinear expected time. Proceedings of 31st IEEE FOCS, (1990), 116–124

    Google Scholar 

  5. Danckaert A., Mugnier C., Dessen P., and Cohen-Solal M. A computer program for the design of optimal synthetic oligonucleotides probes for protein coding genes. CABIOS, 3, (1987) 303–307.

    Google Scholar 

  6. Dumas, J.P., Ninio, J. Efficient algorithms for folding and comparing nucleic acid sequences. Nucl. Acids Res., 10, (1982), 197–206.

    Google Scholar 

  7. Feller W. An introduction to probability theory and its applications. John Wiley & Sons, New York, (1970)

    Google Scholar 

  8. Galil, Z. and Giancarlo, R. Improved string matching with k mismatches. SIGACT News, April, (1986), 52–54.

    Google Scholar 

  9. Grossi R., Luccio F. Simple and efficient string matching with k mismatches. Information Processing Letters, 33, (1990), 113–120

    Google Scholar 

  10. Harrison M.C. Implementation of the substring test by hashing. C.ACM, 14, (1971), 777–779

    Google Scholar 

  11. Hume A., Sunday D. Fast string searching. Software — Practice and Experience, 21, (1991), 1221–1248

    Google Scholar 

  12. Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms. IBM J. Res. Develop., 31, (1987), 249–260

    Google Scholar 

  13. Kim J.Y. Shawe-Taylor J. An approximate string matching algorithm. Theoretical Computer Science, 92, (1992), 107–117

    Google Scholar 

  14. Knuth D.E. The art of computer programming, vol.III: sorting and searching. Addison-Wesley, Reading, Mass., (1973)

    Google Scholar 

  15. Landau G.M., Vishkin U. Efficient string matching with k mismatches, Theoret. Computer Sci., 43, (1986), 239–249

    Google Scholar 

  16. Landau G.M., Vishkin U. Fast parallel and serial approximate string matching. J. of Algorithms, 10, (1989), 157–169

    Google Scholar 

  17. Landau, G.M., Vishkin, U., and Nussinov, R. Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS, 4, (1988), 19–24.

    Google Scholar 

  18. Lipman, D.J., Pearson, W.R. Rapid and sensitive protein similarity searches. Science, 227, (1985), 1435–1441.

    Google Scholar 

  19. Maizel, J. V.,Jr. and Lenk, R.P. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proc. Nat. Acad. Sci. USA, 78, (1981), 7665–7669.

    Google Scholar 

  20. Myers E.W., Mount D. (1986) Computer program for the IBM personal computer that searches for approximate matches of short oligonucleotide sequences in long target DNA sequences. Nucleic Acids Research, 14, 501–508

    Google Scholar 

  21. Myers E.W. (1990) A sublinear algorithm for approximate keyword searching. Technical Report TR-90-25, Department of Computer Science, The University of Arizona, Tucson, Arizona. (to appear in Algorithmica)

    Google Scholar 

  22. Owolabi O., McGregor D.R. Fast approximate string matching. Software-Practice and Experience, 18, (1988), 387–393

    Google Scholar 

  23. Tarhio J., Ukkonen E. Boyer-Moore approach to approximate string matching Lecture Notes in Computer Science, 447, Springer, Berlin, (1990), 348–359

    Google Scholar 

  24. Ukkonen U. Finding approximate patterns in strings. Journal of Algorithms, 6, (1985), 132–137

    Google Scholar 

  25. Ukkonen U. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92, (1992), 191–211

    Google Scholar 

  26. Wilbur W. J., Lipman D.J., Rapid similarity searches of nucleic acid and protein data banks. Proc. Nat. Acad. Sci. USA, 80, (1983), 726–730.

    Google Scholar 

  27. Wu S., Manber U. Agrep — A Fast Approximate Pattern-Matching Tool. Usenix Winter 1992 Technical Conference, San Francisco (January 1992), (1992), 153–162.

    Google Scholar 

  28. Wu S., Manber U. Fast Text Searching Allowing Errors. Comm. of the ACM, 35, No.10 (1992),83–90

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alberto Apostolico Maxime Crochemore Zvi Galil Udi Manber

Rights and permissions

Reprints and permissions

Copyright information

© 1993 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pevzner, P.A., Waterman, M.S. (1993). A fast filtration algorithm for the substring matching problem. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds) Combinatorial Pattern Matching. CPM 1993. Lecture Notes in Computer Science, vol 684. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0029806

Download citation

  • DOI: https://doi.org/10.1007/BFb0029806

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-56764-6

  • Online ISBN: 978-3-540-47732-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics