skip to main content
10.1145/2649387.2660821acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Big data challenges for estimating genome assembler quality

Published: 20 September 2014 Publication History

Abstract

The selection of an appropriate assembler is important to obtain best assembly of a fragment dataset, avoid misassembles and minimize further finishing effort. It is known that the assembly quality of assemblers is dependent on the input data parameters such as DNA fragmentation parameters and genome sequence structure. To the best of our knowledge no large scale systematic effort has been made in quantifying the quality of the assembly generated by various assemblers over a range of input parameters. The correlation between input parameters and assembler quality can be used to define the characteristics of an assembler and design an optimal assembler selection algorithm. The critical barrier is the computational challenge of assembling simulated high-throughput sequence libraries of thousands of genomes with input parameters varied to cover the spectrum of values obtained from major sequencers available to biologists today. We present a study to show that a quantifiable correlation can be drawn between their input and output characteristics for four major open-source assemblers. Based on our result we propose a simple model to estimate the quality of assemblies generated by these assemblers for given input parameters.

References

[1]
M. L. Engle and C. Burks, "GenFrag 2.1: new features for more robust fragment assembly benchmarks," Computer applications in the biosciences: CABIOS, vol. 10, pp. 567--568, September 1, 1994 1994.
[2]
G. Myers, "A Dataset Generator for Whole Genome Shotgun Sequencing," presented at the Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 1999.
[3]
D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and D. H. Huson, "MetaSim---A Sequencing Simulator for Genomics and Metagenomics," PLoS One, vol. 3, p. e3373, 2008.
[4]
S. Balzer, K. Malde, A. Lanzén, A. Sharma, and I. Jonassen, "Characteristics of 454 pyrosequencing data---enabling realistic simulation with flowsim," Bioinformatics, vol. 26, pp. i420--i425, September 15, 2010 2010.
[5]
S. Kumar and M. Blaxter, "Comparing de novo assemblers for 454 transcriptome data," BMC Genomics, vol. 11, p. 571, 2010.
[6]
A. Price, N. Jones, and P. Pevzner, "De novo identification of repeat families in large genomes," Bioinformatics (Oxford, England), vol. 21 Suppl 1, pp. i351--i358, 2005.
[7]
P. Kichenaradja, P. Siguier, J. Pérochon, and M. Chandler, "ISbrowser: an extension of ISfinder for visualizing insertion sequences in prokaryotic genomes," Nucleic Acids Research, vol. 38, pp. D62--D68, 2010.
[8]
A. Biswas, D. Ranjan, and M. Zubair, "Genome Assembly on a Multicore System," presented at the The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-13), Melbourne.
[9]
F. Sanger, S. Nicklen, and A. R. Coulson, "DNA sequencing with chain-terminating inhibitors," Proceedings of the National Academy of Sciences, vol. 74, pp. 5463--5467, December 1, 1977 1977.
[10]
L. M. Smith, J. Z. Sanders, R. J. Kaiser, P. Hughes, C. Dodd, C. R. Connell, C. Heiner, S. B. H. Kent, and L. E. Hood, "Fluorescence detection in automated DNA sequence analysis," Nature, vol. 321, pp. 674--679, 1986.
[11]
J. Tarhio and E. Ukkonen, "A greedy approximation algorithm for constructing shortest common superstrings," Theor. Comput. Sci., vol. 57, pp. 131--145, 1988.
[12]
J. Kececioglu and E. Myers, "Combinatorial algorithms for DNA sequence assembly," Algorithmica, vol. 13, pp. 7--51, 1995.
[13]
P. Green. (1996). PHRAP Documentation. Available: http://www.phrap.org/
[14]
E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. J. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H.-H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter, "A Whole-Genome Assembly of Drosophila," Science, vol. 287, pp. 2196--2204, March 24, 2000 2000.
[15]
S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander, "ARACHNE: A Whole-Genome Shotgun Assembler," Genome Research, vol. 12, pp. 177--189, January 1, 2002 2002.
[16]
M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V. Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, G. P. Irzyk, S. C. Jando, M. L. I. Alenquer, T. P. Jarvie, K. B. Jirage, J.-B. Kim, J. R. Knight, J. R. Lanza, J. H. Leamon, S. M. Lefkowitz, M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani, K. E. McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R. Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J. Sarkis, J. F. Simons, J. W. Simpson, M. Srinivasan, K. R. Tartaro, A. Tomasz, K. A. Vogt, G. A. Volkmer, S. H. Wang, Y. Wang, M. P. Weiner, P. Yu, R. F. Begley, and J. M. Rothberg, "Genome sequencing in microfabricated high-density picolitre reactors," Nature, vol. 437, pp. 376--380, 2005.
[17]
J. M. Rothberg, W. Hinz, T. M. Rearick, J. Schultz, W. Mileski, M. Davey, J. H. Leamon, K. Johnson, M. J. Milgrew, M. Edwards, J. Hoon, J. F. Simons, D. Marran, J. W. Myers, J. F. Davidson, A. Branting, J. R. Nobile, B. P. Puc, D. Light, T. A. Clark, M. Huber, J. T. Branciforte, I. B. Stoner, S. E. Cawley, M. Lyons, Y. Fu, N. Homer, M. Sedova, X. Miao, B. Reed, J. Sabina, E. Feierstein, M. Schorn, M. Alanjary, E. Dimalanta, D. Dressman, R. Kasinskas, T. Sokolsky, J. A. Fidanza, E. Namsaraev, K. J. McKernan, A. Williams, G. T. Roth, and J. Bustillo, "An integrated semiconductor device enabling non-optical genome sequencing," Nature, vol. 475, pp. 348--352, 2011.
[18]
M. A. Quail, I. Kozarewa, F. Smith, A. Scally, P. J. Stephens, R. Durbin, H. Swerdlow, and D. J. Turner, "A large genome center's improvements to the Illumina sequencing system," Nat Meth, vol. 5, pp. 1005--1010, 2008.
[19]
V. Pandey, R. C. Nutter, and E. Prediger, "Applied Biosystems SOLiD#8482; System: Ligation-Based Sequencing," in Next Generation Genome Sequencing, ed: Wiley-VCH Verlag GmbH & Co. KGaA, 2008, pp. 29--42.
[20]
M. J. Chaisson, D. Brinza, and P. A. Pevzner, "De novo fragment assembly with short mate-paired reads: Does the read length matter?," Genome Research, vol. 19, pp. 336--346, February 1, 2009 2009.
[21]
X. Yang, S. P. Chockalingam, and S. Aluru, "A survey of error-correction methods for next-generation sequencing," Briefings in Bioinformatics, April 6, 2012 2012.
[22]
R. M. Idury and M. S. Waterman, "A new algorithm for DNA sequence assembly," Journal of computational biology, vol. 2, pp. 291--306, 1995.
[23]
P. A. Pevzner, H. Tang, and M. S. Waterman, "An Eulerian path approach to DNA fragment assembly," Proceedings of the National Academy of Sciences, vol. 98, pp. 9748--9753, August 14, 2001 2001.
[24]
M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman, and A. Regev, "Full-length transcriptome assembly from RNA-Seq data without a reference genome," Nat Biotech, vol. 29, pp. 644--652, 2011.
[25]
D. R. Zerbino and E. Birney, "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs," Genome Research, vol. 18, pp. 821--829, May 1, 2008 2008.
[26]
J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. M. Jones, and İ. Birol, "ABySS: A parallel assembler for short read sequence data," Genome Research, vol. 19, pp. 1117--1123, June 1, 2009 2009.
[27]
J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe, "ALLPATHS: De novo assembly of whole-genome shotgun microreads," Genome Research, vol. 18, pp. 810--820, May 1, 2008 2008.
[28]
M. Hossain, N. Azimi, and S. Skiena, "Crystallizing short-read assemblies around seeds," BMC bioinformatics, vol. 10, p. S16, 2009.
[29]
S. Boisvert, F. Laviolette, and J. Corbeil, "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies," Journal of computational biology: a journal of computational molecular cell biology, vol. 17, pp. 1519--1533, 2010.
[30]
Y. Pauchet, P. Wilkinson, M. van Munster, S. Augustin, D. Pauron, and R. ffrench-Constant, "Pyrosequencing of the midgut transcriptome of the poplar leaf beetle Chrysomela tremulae reveals new gene families in Coleoptera," Insect Biochemistry and Molecular Biology, vol. 39, pp. 403--413, 2009.
[31]
B. Chevreux, "MIRA: An Automated Genome and EST Assembler," German Cancer Research Center Heidelberg, 2005.
[32]
D. Hernandez, P. François, L. Farinelli, M. Østerås, and J. Schrenzel, "De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer," Genome Research, vol. 18, pp. 802--809, May 1, 2008 2008.
[33]
E. W. Myers, "The fragment assembly string graph," Bioinformatics, vol. 21, pp. ii79--ii85.
[34]
J. Simpson and R. Durbin, "Efficient de novo assembly of large genomes using compressed data structures," Genome Research, vol. 22, pp. 549--556, 2012.
[35]
J. T. Simpson and R. Durbin, "Efficient construction of an assembly string graph using the FM-index," Bioinformatics, vol. 26, pp. i367--i373, June 15, 2010 2010.
[36]
B. G. Jackson, M. Regennitter, X. Yang, P. S. Schnable, and S. Aluru, "Parallel de novo assembly of large genomes from high-throughput short reads," in Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1--10.
[37]
N. Nagarajan and M. Pop, "Parametric complexity of sequence assembly: theory and applications to next generation sequencing," Journal of computational biology: a journal of computational molecular cell biology, vol. 16, pp. 897--908, 2009.
[38]
W. Zhang, J. Chen, Y. Yang, Y. Tang, J. Shang, and B. Shen, "A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies," PLoS One, vol. 6, p. e17915, 2011.
[39]
Y. Lin, J. Li, H. Shen, L. Zhang, C. J. Papasian, and H. W. Deng, "Comparative studies of de novo assembly tools for next-generation sequencing technologies," Bioinformatics, vol. 27, pp. 2031--2037, August 1, 2011 2011.
[40]
G. Narzisi and B. Mishra, "Comparing De Novo Genome Assembly: The Long and Short of It," PLoS One, vol. 6, p. e19175, 2011.
[41]
S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marçais, M. Pop, and J. A. Yorke, "GAGE: A critical evaluation of genome assemblies and assembly algorithms," Genome Research, December 6, 2011 2011.
[42]
A. Phillippy, M. Schatz, and M. Pop, "Genome assembly forensics: finding the elusive mis-assembly," Genome Biology, vol. 9, pp. 1--13, 2008/03/14 2008.
[43]
D. Earl, K. Bradnam, J. St. John, A. Darling, D. Lin, J. Fass, H. O. K. Yu, V. Buffalo, D. R. Zerbino, M. Diekhans, N. Nguyen, P. N. Ariyaratne, W.-K. Sung, Z. Ning, M. Haimel, J. T. Simpson, N. A. Fonseca, İ. Birol, T. R. Docking, I. Y. Ho, D. S. Rokhsar, R. Chikhi, D. Lavenier, G. Chapuis, D. Naquin, N. Maillet, M. C. Schatz, D. R. Kelley, A. M. Phillippy, S. Koren, S.-P. Yang, W. Wu, W.-C. Chou, A. Srivastava, T. I. Shaw, J. G. Ruby, P. Skewes-Cox, M. Betegon, M. T. Dimon, V. Solovyev, I. Seledtsov, P. Kosarev, D. Vorobyev, R. Ramirez-Gonzalez, R. Leggett, D. MacLean, F. Xia, R. Luo, Z. Li, Y. Xie, B. Liu, S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, S. Yin, T. Sharpe, G. Hall, P. J. Kersey, R. Durbin, S. D. Jackman, J. A. Chapman, X. Huang, J. L. DeRisi, M. Caccamo, Y. Li, D. B. Jaffe, R. E. Green, D. Haussler, I. Korf, and B. Paten, "Assemblathon 1: A competitive assessment of de novo short read assembly methods," Genome Research, vol. 21, pp. 2224--2241, December 1, 2011 2011.
[44]
S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. Salzberg, "Versatile and open software for comparing large genomes," Genome Biology, vol. 5, pp. 1--9, 2004/01/30 2004.
[45]
A. E. Darling, B. Mau, and N. T. Perna, "progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement," PLoS One, vol. 5, p. e1 1147, 2010.
[46]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," Journal of Molecular Biology, vol. 215, pp. 403--410, 1990.
[47]
G. Narzisi and B. Mishra, "Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons," Bioinformatics, vol. 27, pp. 153--160, January 15, 2011 2011.
[48]
C. Kingsford, M. Schatz, and M. Pop, "Assembly complexity of prokaryotic genomes using short reads," BMC Bioinformatics, vol. 11, p. 21, 2010.
[49]
M. J. Cahill, C. U. Köser, N. E. Ross, and J. A. C. Archer, "Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies," PLoS One, vol. 5, p. e1 1518, 2010.
[50]
J. A. A. Quitzau and J. Stoye, "Detecting Repeat Families in Incompletely Sequenced Genomes," presented at the Proceedings of the 8th international workshop on Algorithms in Bioinformatics, Karlsruhe, Germany, 2008.
[51]
P. A. Pevzner, H. Tang, and G. Tesler, "De Novo Repeat Classification and Fragment Assembly," Genome Research, vol. 14, pp. 1786--1796, September 1, 2004 2004.
[52]
X. Li and M. S. Waterman, "Estimating the Repeat Structure and Length of DNA Sequences Using ℓ-Tuples," Genome Research, vol. 13, pp. 1916--1922, August 1, 2003 2003.
[53]
A. L. Price, N. C. Jones, and P. A. Pevzner, "De novo identification of repeat families in large genomes," Bioinformatics, vol. 21, pp. i351--i358, 2005.
[54]
Z. Bao and S. R. Eddy, "Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes," Genome Research, vol. 12, pp. 1269--1276, August 1, 2002 2002.
[55]
S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Research, vol. 25, pp. 3389--3402, 1997.

Cited By

View all
  • (2019)Ion Torrent and lllumina, two complementary RNA-seq platforms for constructing the holm oak (Quercus ilex) transcriptomePLOS ONE10.1371/journal.pone.021035614:1(e0210356)Online publication date: 16-Jan-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
September 2014
851 pages
ISBN:9781450328944
DOI:10.1145/2649387
  • General Chairs:
  • Pierre Baldi,
  • Wei Wang
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. assembler characteristics
  2. assembly quality model
  3. big data
  4. genome fragmentation parameters

Qualifiers

  • Research-article

Conference

BCB '14
Sponsor:
BCB '14: ACM-BCB '14
September 20 - 23, 2014
California, Newport Beach

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Ion Torrent and lllumina, two complementary RNA-seq platforms for constructing the holm oak (Quercus ilex) transcriptomePLOS ONE10.1371/journal.pone.021035614:1(e0210356)Online publication date: 16-Jan-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media