Abstract
One of the last steps in a genome assembly project is filling the gaps between consecutive contigs in the scaffolds. This problem can be naturally stated as finding an \(s\)-\(t\) path in a directed graph whose sum of arc costs belongs to a given range (the estimate on the gap length). Here \(s\) and \(t\) are any two contigs flanking a gap. This problem is known to be NP-hard in general. Here we derive a simpler dynamic programming solution than already known, pseudo-polynomial in the maximum value of the input range. We implemented various practical optimizations to it, and compared our exact gap filling solution experimentally to popular gap filling tools. Summing over all the bacterial assemblies considered in our experiments, we can in total fill 28% more gaps than the best previous tool and the gaps filled by our method span 80% more sequence. Furthermore, the error level of the newly introduced sequence is comparable to that of the previous tools.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Boetzer, M., Pirovano, W.: Toward almost closed genomes with GapFiller. Genome Biology 13(6), R56 (2012)
Drezen, E., et al.: GATB: genome assembly & analysis tool box. Bioinformatics 30(20), 2959–2961 (2014)
Durbin, R., et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press (1998)
Dyer, M.E., et al.: A mildly exponential time algorithm for approximating the number of solutions to a multidimensional knapsack problem. Combinatorics, Probability & Computing 2(3), 271–284 (1993)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)
Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108(4), 1513–1518 (2010)
Gurevich, A., et al.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, New York (1972)
Kurtz, S., et al.: Versatile and open software for comparing large genomes. Genome Biology 5(2), R12 (2004)
Langmead, B., et al.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25 (2009)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(18) (2012)
Nadalin, F., et al.: GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics 13(suppl. 14), S8 (2012)
Nykänen, M., Ukkonen, E.: The exact path length problem. J. Algorithms 42(1), 41–53 (2002)
Pabinger, S., et al.: A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics 15(2), 256–278 (2013)
Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(suppl. 1), S225–S233 (2001)
Salzberg, S.L., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22, 549–556 (2012)
Simpson, J., et al.: ABySS: A parallel assembler for short read sequence data. Genome Research 19, 1117–1123 (2009)
Wetzel, J., et al.: Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinformatics 12(1), 95 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Salmela, L., Sahlin, K., Mäkinen, V., Tomescu, A.I. (2015). Gap Filling as Exact Path Length Problem. In: Przytycka, T. (eds) Research in Computational Molecular Biology. RECOMB 2015. Lecture Notes in Computer Science(), vol 9029. Springer, Cham. https://doi.org/10.1007/978-3-319-16706-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-16706-0_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16705-3
Online ISBN: 978-3-319-16706-0
eBook Packages: Computer ScienceComputer Science (R0)