Skip to main content

Approximate All-Pairs Suffix/Prefix Overlaps

  • Conference paper
Combinatorial Pattern Matching (CPM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6129))

Included in the following conference series:

Abstract

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of r strings of total length n and an error-rate ε, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k = ⌈εℓ⌉, where ℓ is the length of the overlap. We propose new solutions for this problem based on backward backtracking (Lam et al. 2008) and suffix filters (Kärkkäinen and Na, 2008). Techniques use nH k  + o(nlogσ) + rlogr bits of space, where H k is the k-th order entropy and σ the alphabet size. In practice, methods are easy to parallelize and scale up to millions of DNA reads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  2. Roche Company. 454 life sciences, http://www.454.com/

  3. Simpson, J.T., et al.: Abyss: A parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)

    Article  MathSciNet  Google Scholar 

  4. Morin, R.D., et al.: Profiling the hela s3 transcriptome using randomly primed cdna and massively parallel short-read sequencing. BioTechniques 45(1), 81–94 (2008)

    Article  Google Scholar 

  5. Li, R., et al.: Soap2. Bioinformatics 25(15), 1966–1967 (2009)

    Article  Google Scholar 

  6. Wicker, T., et al.: 454 sequencing put to the test using the complex genome of barley. BMC Genomics 7(1), 275 (2006)

    Article  Google Scholar 

  7. Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  8. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20 (2007)

    Google Scholar 

  9. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  10. Hyyrö, H., Navarro, G.: Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41(3), 203–231 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  11. Kärkkäinen, J., Na, J.C.: Faster filters for approximate string matching. In: Proc. ALENEX 2007, pp. 84–90. SIAM, Philadelphia (2007)

    Google Scholar 

  12. Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for dna sequence assembly. Algorithmica 13, 7–51 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  13. Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008)

    Article  Google Scholar 

  14. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)

    Google Scholar 

  15. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  16. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (2009), Advance access

    Google Scholar 

  17. Mäkinen, V., Välimäki, N., Laaksonen, A., Katainen, R.: Unifying view of backward backtracking in short read mapping. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) LNCS Festschrifts. Springer, Heidelberg (to appear 2010)

    Google Scholar 

  18. Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3) (2008)

    Google Scholar 

  19. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  20. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  21. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  22. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)

    Google Scholar 

  23. Pevzner, P., Tang, H., Waterman, M.: An eulerian path approach to dna fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  24. Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008)

    Google Scholar 

  25. Salmela, L.: Personal communication (2010)

    Google Scholar 

  26. Sellers, P.: The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms 1(4), 359–373 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  27. Wang, Z., Gerstein, M., Snyder, M.: Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)

    Article  Google Scholar 

  28. Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  29. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18(5), 821–829 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Välimäki, N., Ladra, S., Mäkinen, V. (2010). Approximate All-Pairs Suffix/Prefix Overlaps. In: Amir, A., Parida, L. (eds) Combinatorial Pattern Matching. CPM 2010. Lecture Notes in Computer Science, vol 6129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13509-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13509-5_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13508-8

  • Online ISBN: 978-3-642-13509-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics