Optimal Spaced Seeds for Faster Approximate String Matching

Farach-Colton, Martin; Landau, Gad M.; Sahinalp, S. Cenk; Tsur, Dekel

doi:10.1007/11523468_101

Martin Farach-Colton²¹,
Gad M. Landau²²,
S. Cenk Sahinalp²³ &
…
Dekel Tsur²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3580))

Included in the following conference series:

International Colloquium on Automata, Languages, and Programming

2978 Accesses
1 Citations

Abstract

Filtering is a standard technique for fast approximate string matching in practice.In filtering, a quick first step is used to rule out almost all positions of a text as possible starting positions for a pattern. Typically this step consists of finding the exact matches of small parts of the pattern. In the followup step, a slow method is used to verify or eliminate each remaining position. The running time of such a method depends largely on the quality of the filtering step, as measured by its false positives rate. The quality of such a method depends on the number of true matches that it misses, that is, on its false negative rate.

A spaced seed is a recently introduced type of filter pattern that allows gaps (i.e. don’t cares) in the small sub-pattern to be searched for. Spaced seeds promise to yield a much lower false positives rate, and thus have been extensively studied, though heretofore only heuristically or statistically.

In this paper, we show how to optimally design spaced seeds that yield no false negatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S., Gisch, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. of Molecular Biology 215(3), 403–410 (1990)
Google Scholar
Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k-mismatches. In: Proc. ACM-SIAM SODA, pp. 794–803 (2000)
Google Scholar
Brejova, B., Brown, D.G., Vinar, T.: Vector seeds: an extension to spaced seeds allows substantial improvements in sensitivity and specificity. In: Proc. WABI, pp. 39–54 (2003)
Google Scholar
Buhler, J.: Provably sensitive indexing strategies for biosequence similarity search. In: Proc. ACM RECOMB, pp. 90–99 (2002)
Google Scholar
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proc. ACM RECOMB, pp. 67–75 (2003)
Google Scholar
Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51–70 (2003)
MATH MathSciNet Google Scholar
Califano, A., Rigoutsos, I.: Flash: a fast look-up algorithm for string homology. In: Proc. ISMB, pp. 56–64 (1993)
Google Scholar
Cole, R., Hariharan, R.: Approximate string matching, a simpler, faster algorithm. In: Proc. ACM-SIAM SODA, pp. 463–472 (1997)
Google Scholar
Karpinski, M., Zelikovsky, A.: Approximating dense cases of covering. Electronic Colloquium on Computational Complexity 4(4) (1997)
Google Scholar
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics 138(3), 253–263 (2004)
Article MATH MathSciNet Google Scholar
Kucherov, G., Noé, L., Roytberg, M.: Multi-seed lossless filtration. In: Proc. CPM, pp. 297–310 (2004)
Google Scholar
Kucherov, G., Noé, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proc. IEEE BIBE, pp. 387–394 (2004)
Google Scholar
Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)
Article MATH MathSciNet Google Scholar
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive fast homology search. J. of Bioinformatics and Computational Biology 2(3), 417–439 (2004)
Article Google Scholar
Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Article Google Scholar
Pevzner, P., Waterman, M.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995)
Article MATH MathSciNet Google Scholar
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proc. IEEE FOCS, pp. 320–328 (1996)
Google Scholar
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proc. ACM RECOMB, pp. 76–84 (2004)
Google Scholar
Xu, J., Brown, D., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: Proc. CPM, pp. 47–58 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and DIMACS, Rutgers University,
Martin Farach-Colton
Dept. of Computer Science, University of Haifa,
Gad M. Landau
School of Computing Science, Simon Fraser University,
S. Cenk Sahinalp
Dept. of Computer Science and Engineering, University of California, San Diego
Dekel Tsur

Authors

Martin Farach-Colton
View author publications
You can also search for this author in PubMed Google Scholar
Gad M. Landau
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar
Dekel Tsur
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CITI / Departamento de Informática, Universidade Nova de Lisboa, Portugal
Luís Caires
Dipartimento di Informatica, Sistemi e Produzione, Università di Roma “Tor Vergata”, via del Politecnico 1, 00133, Roma, Italy
Giuseppe F. Italiano
Departamento de Informatica, Universidade Nova de Lisboa, 2829-516, Caparica, Portugal
Luís Monteiro
Ecole Polytechnique, Rue de Saclay, 91128, Palaiseau Cedex, France
Catuscia Palamidessi
Computer Science Department, Google Inc. and Columbia University, 1214 Amsterdam Avenue, NY 10027, New York, USA
Moti Yung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Farach-Colton, M., Landau, G.M., Sahinalp, S.C., Tsur, D. (2005). Optimal Spaced Seeds for Faster Approximate String Matching. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds) Automata, Languages and Programming. ICALP 2005. Lecture Notes in Computer Science, vol 3580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11523468_101

Download citation

DOI: https://doi.org/10.1007/11523468_101
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27580-0
Online ISBN: 978-3-540-31691-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics