Skip to main content

Gap Filling as Exact Path Length Problem

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2015)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9029))

Abstract

One of the last steps in a genome assembly project is filling the gaps between consecutive contigs in the scaffolds. This problem can be naturally stated as finding an \(s\)-\(t\) path in a directed graph whose sum of arc costs belongs to a given range (the estimate on the gap length). Here \(s\) and \(t\) are any two contigs flanking a gap. This problem is known to be NP-hard in general. Here we derive a simpler dynamic programming solution than already known, pseudo-polynomial in the maximum value of the input range. We implemented various practical optimizations to it, and compared our exact gap filling solution experimentally to popular gap filling tools. Summing over all the bacterial assemblies considered in our experiments, we can in total fill 28% more gaps than the best previous tool and the gaps filled by our method span 80% more sequence. Furthermore, the error level of the newly introduced sequence is comparable to that of the previous tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Boetzer, M., Pirovano, W.: Toward almost closed genomes with GapFiller. Genome Biology 13(6), R56 (2012)

    Article  Google Scholar 

  2. Drezen, E., et al.: GATB: genome assembly & analysis tool box. Bioinformatics 30(20), 2959–2961 (2014)

    Article  Google Scholar 

  3. Durbin, R., et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998)

    Google Scholar 

  4. Dyer, M.E., et al.: A mildly exponential time algorithm for approximating the number of solutions to a multidimensional knapsack problem. Combinatorics, Probability & Computing 2(3), 271–284 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)

    MATH  Google Scholar 

  6. Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108(4), 1513–1518 (2010)

    Article  Google Scholar 

  7. Gurevich, A., et al.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)

    Article  Google Scholar 

  8. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, New York (1972)

    Chapter  Google Scholar 

  9. Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biology 5(2), R12 (2004)

    Article  Google Scholar 

  10. Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)

    Article  Google Scholar 

  11. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  12. Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(18) (2012)

    Google Scholar 

  13. Nadalin, F., et al.: GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics 13(suppl. 14), S8 (2012)

    Article  Google Scholar 

  14. Nykänen, M., Ukkonen, E.: The exact path length problem. J. Algorithms 42(1), 41–53 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  15. Pabinger, S., et al.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics 15(2), 256–278 (2013)

    Article  Google Scholar 

  16. Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(suppl. 1), S225–S233 (2001)

    Article  Google Scholar 

  17. Salzberg, S.L., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)

    Article  Google Scholar 

  18. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22, 549–556 (2012)

    Article  Google Scholar 

  19. Simpson, J., et al.: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)

    Article  Google Scholar 

  20. Wetzel, J., et al.: Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12(1), 95 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leena Salmela .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Salmela, L., Sahlin, K., Mäkinen, V., Tomescu, A.I. (2015). Gap Filling as Exact Path Length Problem. In: Przytycka, T. (eds) Research in Computational Molecular Biology. RECOMB 2015. Lecture Notes in Computer Science(), vol 9029. Springer, Cham. https://doi.org/10.1007/978-3-319-16706-0_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16706-0_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16705-3

  • Online ISBN: 978-3-319-16706-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics