Elsevier

Advances in Computers

Volume 60, 2004, Pages 193-248
Advances in Computers

Shotgun Sequence Assembly

https://doi.org/10.1016/S0065-2458(03)60006-9Get rights and content

Abstract

Shotgun sequencing is the most widely used technique for determining the DNA sequence of organisms. It involves breaking up the DNA into many small pieces that can be read by automated sequencing machines, then piecing together the original genome using specialized software programs called assemblers. Due to the large amounts of data being generated and to the complex structure of most organisms' genomes, successful assembly programs rely on sophisticated algorithms based on knowledge from such diverse fields as statistics, graph theory, computer science, and computer engineering. Throughout this chapter we will describe the main computational challenges imposed by the shotgun sequencing method, and survey the most widely used assembly algorithms.

Introduction

In 1982 Fred Sanger developed a new technique called shotgun sequencing and proved its worth by sequencing the complete genome of the bacteriophage Lambda [1]. This technique attempted to overcome the limitations of sequencing technologies by breaking up the DNA at random. Sequencing techniques were only able to “read” several hundred nucleotides at a time. The resulting pieces were assembled together based on the similarity between pieces derived from the same section of the original DNA molecule. The large amount of data produced by shotgun sequencing made it necessary to utilize computer programs to assist the assembly [2], [3]. Despite continued improvements in sequencing technology and the development of specialized assembly programs, it was unclear whether shotgun sequencing could be used to sequence genomes larger than those of viruses (typically 5000–100,000 nucleotides). For larger genomes it was thought that the complexity of the task would pose an insurmountable challenge to any computer program.

In 1995, however, researchers at The Institute for Genomic Research (TIGR) successfully used the shotgun sequencing technique to decipher the complete genome of the bacterium Haemophilus influenzae[4]. The sequencing of this 1.83 million base pair genome required the development of a specialized assembly program [5] as well as painstaking laboratory efforts to complete those regions that could not correctly be assembled by the software. The success of the Haemophilus project started a genomics revolution with the number of genomes being sequenced every year increasing at an exponential rate. At the moment the genomes of more than 1000 viruses, 100 bacteria, and several eukaryotes have been completed, while multiple other projects are well on the way to completion. In parallel with the large amounts of genomic data becoming available, the genomic revolution led to the birth of a new field—bioinformatics—bringing together an eclectic mix of scientific fields such as computer science and engineering, mathematics, physics, chemistry, and biology.

Critics of the shotgun sequencing approach continued to question its applicability to large genomes despite obvious successes in sequencing bacterial genomes. They argued the technique would be impractical in the case of large eukaryotic genomes because repeats—stretches of DNA that occur in two or more copies within the genome—would hopelessly confuse any assembler [6]. The standard procedure for handling large genomes was a hierarchical approach involving breaking up the DNA into large (50–150 kbp) pieces cloned in bacterial artificial chromosomes (BACs), and then sequencing each BAC through the shotgun method. Most such criticism was silenced in 2000 by the successful assembly at Celera of the genome of Drosophila melanogaster[7] from whole-genome shotgun (WGS) data. The assembly was performed with a new assembler [8] designed to handle the specific problems involved in assembling large complex genomes. The researchers from Celera went on to assemble the human genome using the same whole-genome shotgun sequencing technique [9]. Their results were published simultaneously with those from the International Human Genome Sequencing Consortium, who used the traditional hierarchical method [10]. Independent studies [11], [12] later showed that the two assemblies produced similar results and many of the differences between them could be explained by the draft-level quality of the data. The applicability of the WGS method to large genomes was thus proven though some continue to argue the validity of Celera's results (some opinions on this topic are presented in [13], [14], [15], [16], [17]).

Celera's success combined with the cost advantages of the WGS technique—Celera sequenced and assembled the human genome in a little over a year while the international consortium's efforts had been going on for more than 10 years—renewed interest in the WGS method and led to the development of several WGS assembly programs: Arachne [18], [19] at the Whitehead Institute, Phusion [20] at the Sanger Institute, Atlas [21] at the Baylor Human Genome Sequencing Center, and Jazz [22] at the DOE Joint Genome Institute. Most current sequencing projects have opted for a WGS approach instead of the hierarchical approach. For example the sequencing of the mouse [23], rat [24], dog [25], puffer fish [22], and sea squirt [26] all follow the WGS strategy.

The current issue of debate is the suitability of whole-genome shotgun sequencing as the starting point in the efforts to obtain the complete sequence for a genome. All sequencing strategies start by building a backbone, or rough draft, of the genome whose gaps need to be filled in through further laboratory experiments. It is still not clear which sequencing strategy will ultimately be the most efficient in obtaining the complete sequence of an organism, especially as none of the large eukaryotic projects have yet been finished, except for the 100 Mbp genome of the nematode Caenorhabditis elegans, finished in October 2002. (The genomes of Drosophila melanogaster and human are expected to be mostly finished before the end of 2003.)

Despite significant differences in the overall structuring of the sequencing process, all sequencing strategies rely on shotgun sequencing as a basic component. The reader is referred to [27], [28] for an in-depth discussion of current approaches to sequencing. The following sections represent a description of the shotgun sequencing technique, with a emphasis on the algorithmic challenges imposed by this technique.

Section snippets

Shotgun Sequencing Overview

The process of shotgun sequencing starts by physically breaking up the DNA molecule into millions of random fragments. The fragments are then inserted into cloning vectors1 in order to amplify the DNA to levels needed by the sequencing reactions. Commonly used cloning vectors are plasmids (circular pieces

Assembly Paradigms

In its most general form the sequence assembly problem involves reconstructing the genome from the shotgun reads based on sequence similarity alone. This problem can be further decomposed into two problems: the mapping or layout problem, in which all reads need to be positioned correctly in the genome, and the consensus problem, in which the contiguous DNA sequence of the genome is computed. It can be easily seen that in this formulation the general problem is impossible to solve. For example,

Overlap Detection

The basic assumption of shotgun sequencing is that sequence similarity between two reads is an indication that the reads originate from the same section of the genome. All assembly algorithms must therefore identify similarities between reads. The specific algorithmic approaches to the task have evolved throughout the years, as increasingly more complex sequencing projects were tackled through the shotgun method. The earliest algorithms involved either iteratively aligning each read to an

Exotic Assembly

Up to this point we have presented solutions to the most common problems related to shotgun sequence assembly. These algorithms contributed to the current genomic revolution leading to an exponentially increasing number of genomes being sequenced. This increase in the numbers and types of genomes that are analyzed is uncovering new problems to be solved by assembly programs. In this section we will briefly discuss a few of the current assembly challenges.

Conclusions

The assembly problem was repeatedly considered solved, first when efficient approximation algorithms for the shortest superstring problem became available, again when assembly software was able to routinely assemble entire bacterial genomes, and recently when software exists that can assemble entire mammalian genomes in a relatively short time. Continued reductions in sequencing costs have led to a dramatic increase in the numbers of genomes being sequenced. A direct effect of this genomic

Acknowledgements

I would like to thank Art Delcher, Adam Phillippy, and Steven Salzberg for their useful comments and continued support. This work was supported in part by the National Institutes of Health under grant R01-LM06845.

References (137)

  • X. Huang

    An improved sequence assembly program

    Genomics

    (1996)
  • R. Staden

    Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing

    Nucleic Acids Res.

    (1982)
  • T.R. Gingeras

    Computer programs for the assembly of DNA sequences

    Nucleic Acids Res.

    (1979)
  • R.D. Fleischmann

    Whole-genome random sequencing and assembly of Haemophilus influenzae Rd

    Science

    (1995)
  • G.G. Sutton

    TIGR assembler: A new tool for assembling large shotgun sequencing projects

    Genome Science and Technology

    (1995)
  • P. Green

    Against a whole-genome shotgun

    Genome Res.

    (1997)
  • M.D. Adams

    The genome sequence of Drosophila melanogaster

    Science

    (2000)
  • E.W. Myers

    A whole-genome assembly of Drosophila

    Science

    (2000)
  • J.C. Venter

    The sequence of the human genome

    Science

    (2001)
  • I.H.G.S. Consortium

    Initial sequencing and analysis of the human genome

    Nature

    (2001)
  • C.A. Semple

    Computational comparison of human genomic sequence assemblies for a region of chromosome 4

    Genome Res.

    (2002)
  • J. Aach

    Computational comparison of two draft sequences of the human genome

    Nature

    (2001)
  • M.D. Adams

    The independence of our genome assemblies

    Proc. Natl. Acad. Sci. USA

    (2003)
  • R.H. Waterston et al.

    More on the sequencing of the human genome

    Proc. Natl. Acad. Sci. USA

    (2003)
  • R.H. Waterston et al.

    On the sequencing of the human genome

    Proc. Natl. Acad. Sci. USA

    (2002)
  • P. Green

    Whole-genome disassembly

    Proc. Natl. Acad. Sci. USA

    (2002)
  • E.W. Myers

    On the sequencing and assembly of the human genome

    Proc. Natl. Acad. Sci. USA

    (2002)
  • S. Batzoglou

    ARACHNE: a whole-genome shotgun assembler

    Genome Res.

    (2002)
  • D.B. Jaffe

    Whole-genome sequence assembly for Mammalian genomes: arachne 2

    Genome Res.

    (2003)
  • J.C. Mullikin et al.

    The phusion assembler

    Genome Res.

    (2003)
  • P. Havlak

    The Atlas whole-genome assembler

  • S. Aparicio

    Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes

    Science

    (2002)
  • R.H. Waterston

    Initial sequencing and comparative analysis of the mouse genome

    Nature

    (2002)
  • Consortium R.g.s

    Rat. Genome project

  • E.F. Kirkness

    The dog genome: survey sequencing and comparative analysis

    Science

    (2003)
  • P. Dehal

    The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins

    Science

    (2002)
  • E.D. Green

    Strategies for the systematic sequencing of complex genomes

    Nat. Rev. Genet.

    (2001)
  • W.W. Cai

    A clone-array pooled shotgun strategy for sequencing large genomes

    Genome Res.

    (2001)
  • E. Czabarka

    Algorithms for optimizing production DNA sequencing

  • S. Batzoglou

    Sequencing a genome by walking with clone-end sequences: A mathematical analysis

    Genome Res.

    (1999)
  • X. Li et al.

    Estimating the repeat structure and length of DNA sequences using ell-tuples

    Genome Res.

    (2003)
  • R.F. Yeh

    Predicting Progress in Shotgun Sequencing with Paired Ends

    (2002)
  • S.L. Chissoe

    Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution

    Nucleic Acids Res.

    (1997)
  • K. Mullis

    Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction

    Cold Spring Harb. Symp. Quant. Biol.

    (1986)
  • L.J. Burgart

    Multiplex polymerase chain reaction

    Mod. Pathol.

    (1992)
  • R. Beigel

    An optimal procedure for gap closing in whole genome shotgun sequencing

  • N. Alon

    Learning a hidden matching

  • R. Staden et al.

    Sequence assembly and finishing methods

    Methods Biochem. Anal.

    (2001)
  • D. Gordon et al.

    Automated finishing with autofinish

    Genome Res.

    (2001)
  • D. Gordon et al.

    Consed: A graphical tool for sequence finishing

    Genome Res.

    (1998)
  • Cited by (26)

    • Sequence assembly

      2009, Computational Biology and Chemistry
      Citation Excerpt :

      In EST datasets the main difficulty is to develop an algorithmic approach that, in addition to efficient assembly, can handle highly expressed genes, paralogous genes, alternative spliceforms and chimerism in the dataset. The theoretical background for genome assembly lies in computer science, and an insight into the mathematical and theoretical background can be found in (Pop, 2004) and references therein. Although pyrosequencing with a whole-genome shotgun approach has been successfully applied to bacterial genomes (Margulies et al., 2005), the construction of high-quality assemblies with high-throughput sequencing data is still a non-trivial problem even for short genomes.

    View all citing articles on Scopus
    View full text