Skip to main content

Optimal Spaced Seeds for Faster Approximate String Matching

  • Conference paper
Automata, Languages and Programming (ICALP 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3580))

Included in the following conference series:

Abstract

Filtering is a standard technique for fast approximate string matching in practice.In filtering, a quick first step is used to rule out almost all positions of a text as possible starting positions for a pattern. Typically this step consists of finding the exact matches of small parts of the pattern. In the followup step, a slow method is used to verify or eliminate each remaining position. The running time of such a method depends largely on the quality of the filtering step, as measured by its false positives rate. The quality of such a method depends on the number of true matches that it misses, that is, on its false negative rate.

A spaced seed is a recently introduced type of filter pattern that allows gaps (i.e. don’t cares) in the small sub-pattern to be searched for. Spaced seeds promise to yield a much lower false positives rate, and thus have been extensively studied, though heretofore only heuristically or statistically.

In this paper, we show how to optimally design spaced seeds that yield no false negatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S., Gisch, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. of Molecular Biology 215(3), 403–410 (1990)

    Google Scholar 

  2. Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k-mismatches. In: Proc. ACM-SIAM SODA, pp. 794–803 (2000)

    Google Scholar 

  3. Brejova, B., Brown, D.G., Vinar, T.: Vector seeds: an extension to spaced seeds allows substantial improvements in sensitivity and specificity. In: Proc. WABI, pp. 39–54 (2003)

    Google Scholar 

  4. Buhler, J.: Provably sensitive indexing strategies for biosequence similarity search. In: Proc. ACM RECOMB, pp. 90–99 (2002)

    Google Scholar 

  5. Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proc. ACM RECOMB, pp. 67–75 (2003)

    Google Scholar 

  6. Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51–70 (2003)

    MATH  MathSciNet  Google Scholar 

  7. Califano, A., Rigoutsos, I.: Flash: a fast look-up algorithm for string homology. In: Proc. ISMB, pp. 56–64 (1993)

    Google Scholar 

  8. Cole, R., Hariharan, R.: Approximate string matching, a simpler, faster algorithm. In: Proc. ACM-SIAM SODA, pp. 463–472 (1997)

    Google Scholar 

  9. Karpinski, M., Zelikovsky, A.: Approximating dense cases of covering. Electronic Colloquium on Computational Complexity 4(4) (1997)

    Google Scholar 

  10. Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics 138(3), 253–263 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  11. Kucherov, G., Noé, L., Roytberg, M.: Multi-seed lossless filtration. In: Proc. CPM, pp. 297–310 (2004)

    Google Scholar 

  12. Kucherov, G., Noé, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proc. IEEE BIBE, pp. 387–394 (2004)

    Google Scholar 

  13. Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  14. Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive fast homology search. J. of Bioinformatics and Computational Biology 2(3), 417–439 (2004)

    Article  Google Scholar 

  15. Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    Article  Google Scholar 

  16. Pevzner, P., Waterman, M.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  17. Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proc. IEEE FOCS, pp. 320–328 (1996)

    Google Scholar 

  18. Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proc. ACM RECOMB, pp. 76–84 (2004)

    Google Scholar 

  19. Xu, J., Brown, D., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: Proc. CPM, pp. 47–58 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Farach-Colton, M., Landau, G.M., Sahinalp, S.C., Tsur, D. (2005). Optimal Spaced Seeds for Faster Approximate String Matching. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds) Automata, Languages and Programming. ICALP 2005. Lecture Notes in Computer Science, vol 3580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11523468_101

Download citation

  • DOI: https://doi.org/10.1007/11523468_101

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-27580-0

  • Online ISBN: 978-3-540-31691-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics