Abstract
Filtering is a standard technique for fast approximate string matching in practice.In filtering, a quick first step is used to rule out almost all positions of a text as possible starting positions for a pattern. Typically this step consists of finding the exact matches of small parts of the pattern. In the followup step, a slow method is used to verify or eliminate each remaining position. The running time of such a method depends largely on the quality of the filtering step, as measured by its false positives rate. The quality of such a method depends on the number of true matches that it misses, that is, on its false negative rate.
A spaced seed is a recently introduced type of filter pattern that allows gaps (i.e. don’t cares) in the small sub-pattern to be searched for. Spaced seeds promise to yield a much lower false positives rate, and thus have been extensively studied, though heretofore only heuristically or statistically.
In this paper, we show how to optimally design spaced seeds that yield no false negatives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S., Gisch, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. of Molecular Biology 215(3), 403–410 (1990)
Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with k-mismatches. In: Proc. ACM-SIAM SODA, pp. 794–803 (2000)
Brejova, B., Brown, D.G., Vinar, T.: Vector seeds: an extension to spaced seeds allows substantial improvements in sensitivity and specificity. In: Proc. WABI, pp. 39–54 (2003)
Buhler, J.: Provably sensitive indexing strategies for biosequence similarity search. In: Proc. ACM RECOMB, pp. 90–99 (2002)
Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proc. ACM RECOMB, pp. 67–75 (2003)
Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. Fundamenta Informaticae 56, 51–70 (2003)
Califano, A., Rigoutsos, I.: Flash: a fast look-up algorithm for string homology. In: Proc. ISMB, pp. 56–64 (1993)
Cole, R., Hariharan, R.: Approximate string matching, a simpler, faster algorithm. In: Proc. ACM-SIAM SODA, pp. 463–472 (1997)
Karpinski, M., Zelikovsky, A.: Approximating dense cases of covering. Electronic Colloquium on Computational Complexity 4(4) (1997)
Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Applied Mathematics 138(3), 253–263 (2004)
Kucherov, G., Noé, L., Roytberg, M.: Multi-seed lossless filtration. In: Proc. CPM, pp. 297–310 (2004)
Kucherov, G., Noé, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proc. IEEE BIBE, pp. 387–394 (2004)
Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter II: Highly sensitive fast homology search. J. of Bioinformatics and Computational Biology 2(3), 417–439 (2004)
Ma, B., Tromp, J., Li, M.: PatternHunter: Faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)
Pevzner, P., Waterman, M.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995)
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm. In: Proc. IEEE FOCS, pp. 320–328 (1996)
Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proc. ACM RECOMB, pp. 76–84 (2004)
Xu, J., Brown, D., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: Proc. CPM, pp. 47–58 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Farach-Colton, M., Landau, G.M., Sahinalp, S.C., Tsur, D. (2005). Optimal Spaced Seeds for Faster Approximate String Matching. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds) Automata, Languages and Programming. ICALP 2005. Lecture Notes in Computer Science, vol 3580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11523468_101
Download citation
DOI: https://doi.org/10.1007/11523468_101
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27580-0
Online ISBN: 978-3-540-31691-6
eBook Packages: Computer ScienceComputer Science (R0)