Approximate All-Pairs Suffix/Prefix Overlaps

Välimäki, Niko; Ladra, Susana; Mäkinen, Veli

doi:10.1007/978-3-642-13509-5_8

Niko Välimäki¹⁸,
Susana Ladra¹⁹ &
Veli Mäkinen¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6129))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

790 Accesses
5 Citations

Abstract

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of r strings of total length n and an error-rate ε, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k = ⌈εℓ⌉, where ℓ is the length of the overlap. We propose new solutions for this problem based on backward backtracking (Lam et al. 2008) and suffix filters (Kärkkäinen and Na, 2008). Techniques use nH _k + o(nlogσ) + rlogr bits of space, where H _k is the k-th order entropy and σ the alphabet size. In practice, methods are easy to parallelize and scale up to millions of DNA reads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Google Scholar
Roche Company. 454 life sciences, http://www.454.com/
Simpson, J.T., et al.: Abyss: A parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)
Article MathSciNet Google Scholar
Morin, R.D., et al.: Profiling the hela s3 transcriptome using randomly primed cdna and massively parallel short-read sequencing. BioTechniques 45(1), 81–94 (2008)
Article Google Scholar
Li, R., et al.: Soap2. Bioinformatics 25(15), 1966–1967 (2009)
Article Google Scholar
Wicker, T., et al.: 454 sequencing put to the test using the complex genome of barley. BMC Genomics 7(1), 275 (2006)
Article Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20 (2007)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
Hyyrö, H., Navarro, G.: Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41(3), 203–231 (2005)
Article MATH MathSciNet Google Scholar
Kärkkäinen, J., Na, J.C.: Faster filters for approximate string matching. In: Proc. ALENEX 2007, pp. 84–90. SIAM, Philadelphia (2007)
Google Scholar
Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for dna sequence assembly. Algorithmica 13, 7–51 (1995)
Article MATH MathSciNet Google Scholar
Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (2009), Advance access
Google Scholar
Mäkinen, V., Välimäki, N., Laaksonen, A., Katainen, R.: Unifying view of backward backtracking in short read mapping. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) LNCS Festschrifts. Springer, Heidelberg (to appear 2010)
Google Scholar
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3) (2008)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)
Article MATH MathSciNet Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surveys 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)
Google Scholar
Pevzner, P., Tang, H., Waterman, M.: An eulerian path approach to dna fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)
Article MATH MathSciNet Google Scholar
Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008)
Google Scholar
Salmela, L.: Personal communication (2010)
Google Scholar
Sellers, P.: The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms 1(4), 359–373 (1980)
Article MATH MathSciNet Google Scholar
Wang, Z., Gerstein, M., Snyder, M.: Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)
Article Google Scholar
Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18(5), 821–829 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Finland
Niko Välimäki & Veli Mäkinen
Department of Computer Science, University of A Coruña, Spain
Susana Ladra

Authors

Niko Välimäki
View author publications
You can also search for this author in PubMed Google Scholar
Susana Ladra
View author publications
You can also search for this author in PubMed Google Scholar
Veli Mäkinen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA, and Bar-Ilan University, 52900, Ramat-Gan, Israel
Amihood Amir
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Laxmi Parida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Välimäki, N., Ladra, S., Mäkinen, V. (2010). Approximate All-Pairs Suffix/Prefix Overlaps. In: Amir, A., Parida, L. (eds) Combinatorial Pattern Matching. CPM 2010. Lecture Notes in Computer Science, vol 6129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13509-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-13509-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13508-8
Online ISBN: 978-3-642-13509-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics